Credit-Score-Analysis-and-Prediction¶

A data science and machine learning project for analyzing financial behavior and predicting credit default risk and credit scores using statistical inference, dimensionality reduction, clustering, and supervised learning models.


📋 Project Overview¶

This project presents a full end-to-end credit risk analysis pipeline, combining classical statistics and modern machine learning techniques to model customer financial behavior and predict credit outcomes.


Libraries and Packages¶

In [114]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import prince

from xgboost import XGBClassifier, XGBRegressor
from sklearn.preprocessing import PowerTransformer, StandardScaler, RobustScaler
from sklearn.cross_decomposition import CCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import roc_auc_score, mean_squared_error, r2_score, roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix, mean_absolute_error
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor 
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Ridge
from imblearn.over_sampling import SMOTE

from scipy.stats import(
    norm
    , t
    , chi2
    , chi2_contingency
    , kstest
    , skew
)

#File path image
from pathlib import Path
figure_dir = Path("figure")
figure_dir.mkdir(exist_ok=True)

1. About Dataset¶

The dataset used in this study is publicly available on Kaggle and is called “Credit Score”. It contains with 1000 rows and 84 features comprehensive financial and behavioral information on a large number of customers. This dataset includes a wide range of variables, providing a multidimensional view of an individual’s financial situation, spending behavior, and potential credit risk. By combining raw financial measures, behavioral ratios, and categorical attributes, this dataset enables the development of robust and interpretable predictive models for analyzing credit risk and default.

  • CUST_ID: Unique customer identifier

Key Target Variables:

  • CREDIT_SCORE: Numerical target variable representing the customer's credit score (integer)
  • DEFAULT: Binary target variable indicating if the customer has defaulted (1) or not (0)

Description of Features:

  • INCOME: Total income in the last 12 months
  • SAVINGS: Total savings in the last 12 months
  • DEBT: Total existing debt
  • R_SAVINGS_INCOME: Ratio of savings to income
  • R_DEBT_INCOME: Ratio of debt to income
  • R_DEBT_SAVINGS: Ratio of debt to savings
  • Transaction groups (GROCERIES, CLOTHING, HOUSING, EDUCATION, HEALTH, TRAVEL, ENTERTAINMENT,GAMBLING, UTILITIES, TAX, FINES) are categorized.

  • T_{GROUP}_6: Total expenditure in that group in the last 6 months

  • T_GROUP_12: Total expenditure in that group in the last 12 months
  • R[GROUP]: Ratio of T[GROUP]6 to T[GROUP]_12
  • R_[GROUP]INCOME: Ratio of T[GROUP]_12 to INCOME
  • R_[GROUP]SAVINGS: Ratio of T[GROUP]_12 to SAVINGS
  • R_[GROUP]DEBT: Ratio of T[GROUP]_12 to DEBT

    Categorical Features:

  • CAT_GAMBLING: Gambling category (none, low, high)

  • CAT_DEBT: 1 if the customer has debt; 0 otherwise
  • CAT_CREDIT_CARD: 1 if the customer has a credit card; 0 otherwise
  • CAT_MORTGAGE: 1 if the customer has a mortgage; 0 otherwise
  • CAT_SAVINGS_ACCOUNT: 1 if the customer has a savings account; 0 otherwise
  • CAT_DEPENDENTS: 1 if the customer has any dependents; 0 otherwise

Summary :

  • Customer ID (string)
  • Financial numeric features (Income, Savings, Debt,…)
  • Ratios
  • Transaction amounts over 6 and 12 months
  • Binary categorical indicators (one-hot encoded) Two target variables:
  • CREDIT_SCORE (regression)
  • DEFAULT (classification)

2. Loading Dataset¶

In [115]:
df = pd.read_csv(r"./datasets/credit_score.csv")
df
Out[115]:
CUST_ID INCOME SAVINGS DEBT R_SAVINGS_INCOME R_DEBT_INCOME R_DEBT_SAVINGS T_CLOTHING_12 T_CLOTHING_6 R_CLOTHING ... R_EXPENDITURE_SAVINGS R_EXPENDITURE_DEBT CAT_GAMBLING CAT_DEBT CAT_CREDIT_CARD CAT_MORTGAGE CAT_SAVINGS_ACCOUNT CAT_DEPENDENTS CREDIT_SCORE DEFAULT
0 C02COQEVYU 33269 0 532304 0.0000 16.0000 1.2000 1889 945 0.5003 ... 0.0000 0.0625 High 1 0 0 0 0 444 1
1 C02OZKC0ZF 77158 91187 315648 1.1818 4.0909 3.4615 5818 111 0.0191 ... 0.7692 0.2222 No 1 0 0 1 0 625 0
2 C03FHP2D0A 30917 21642 534864 0.7000 17.3000 24.7142 1157 860 0.7433 ... 1.4286 0.0578 High 1 0 0 1 0 469 1
3 C03PVPPHOY 80657 64526 629125 0.8000 7.8000 9.7499 6857 3686 0.5376 ... 1.2500 0.1282 High 1 0 0 1 0 559 0
4 C04J69MUX0 149971 1172498 2399531 7.8182 16.0000 2.0465 1978 322 0.1628 ... 0.1163 0.0568 High 1 1 1 1 1 473 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 CZQHJC9HDH 328892 1465066 5501471 4.4546 16.7273 3.7551 16701 10132 0.6067 ... 0.2041 0.0543 High 1 1 1 1 1 418 0
996 CZRA4MLB0P 81404 88805 680837 1.0909 8.3637 7.6667 5400 1936 0.3585 ... 0.8333 0.1087 No 1 0 0 1 0 589 1
997 CZSOD1KVFX 0 42428 30760 3.2379 8.1889 0.7250 0 0 0.8779 ... 0.2500 0.3448 No 1 0 0 1 0 499 0
998 CZWC76UAUT 36011 8002 604181 0.2222 16.7777 75.5037 1993 1271 0.6377 ... 5.0002 0.0662 No 1 1 0 1 0 507 0
999 CZZV5B3SAL 44266 309859 44266 6.9999 1.0000 0.1429 1574 1264 0.8030 ... 0.1587 1.1111 No 1 0 0 1 0 657 0

1000 rows × 87 columns

In [116]:
df.columns[1:5]
Out[116]:
Index(['INCOME', 'SAVINGS', 'DEBT', 'R_SAVINGS_INCOME'], dtype='object')
In [117]:
print("Target columns in the dataset: ")
df[['DEFAULT', 'CREDIT_SCORE']]
Target columns in the dataset: 
Out[117]:
DEFAULT CREDIT_SCORE
0 1 444
1 0 625
2 1 469
3 0 559
4 0 473
... ... ...
995 0 418
996 1 589
997 0 499
998 0 507
999 0 657

1000 rows × 2 columns

In [118]:
print("Dataset infomation : ")
df.info()
Dataset infomation : 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 87 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   CUST_ID                  1000 non-null   object 
 1   INCOME                   1000 non-null   int64  
 2   SAVINGS                  1000 non-null   int64  
 3   DEBT                     1000 non-null   int64  
 4   R_SAVINGS_INCOME         1000 non-null   float64
 5   R_DEBT_INCOME            1000 non-null   float64
 6   R_DEBT_SAVINGS           1000 non-null   float64
 7   T_CLOTHING_12            1000 non-null   int64  
 8   T_CLOTHING_6             1000 non-null   int64  
 9   R_CLOTHING               1000 non-null   float64
 10  R_CLOTHING_INCOME        1000 non-null   float64
 11  R_CLOTHING_SAVINGS       1000 non-null   float64
 12  R_CLOTHING_DEBT          1000 non-null   float64
 13  T_EDUCATION_12           1000 non-null   int64  
 14  T_EDUCATION_6            1000 non-null   int64  
 15  R_EDUCATION              1000 non-null   float64
 16  R_EDUCATION_INCOME       1000 non-null   float64
 17  R_EDUCATION_SAVINGS      1000 non-null   float64
 18  R_EDUCATION_DEBT         1000 non-null   float64
 19  T_ENTERTAINMENT_12       1000 non-null   int64  
 20  T_ENTERTAINMENT_6        1000 non-null   int64  
 21  R_ENTERTAINMENT          1000 non-null   float64
 22  R_ENTERTAINMENT_INCOME   1000 non-null   float64
 23  R_ENTERTAINMENT_SAVINGS  1000 non-null   float64
 24  R_ENTERTAINMENT_DEBT     1000 non-null   float64
 25  T_FINES_12               1000 non-null   int64  
 26  T_FINES_6                1000 non-null   int64  
 27  R_FINES                  1000 non-null   float64
 28  R_FINES_INCOME           1000 non-null   float64
 29  R_FINES_SAVINGS          1000 non-null   float64
 30  R_FINES_DEBT             1000 non-null   float64
 31  T_GAMBLING_12            1000 non-null   int64  
 32  T_GAMBLING_6             1000 non-null   int64  
 33  R_GAMBLING               1000 non-null   float64
 34  R_GAMBLING_INCOME        1000 non-null   float64
 35  R_GAMBLING_SAVINGS       1000 non-null   float64
 36  R_GAMBLING_DEBT          1000 non-null   float64
 37  T_GROCERIES_12           1000 non-null   int64  
 38  T_GROCERIES_6            1000 non-null   int64  
 39  R_GROCERIES              1000 non-null   float64
 40  R_GROCERIES_INCOME       1000 non-null   float64
 41  R_GROCERIES_SAVINGS      1000 non-null   float64
 42  R_GROCERIES_DEBT         1000 non-null   float64
 43  T_HEALTH_12              1000 non-null   int64  
 44  T_HEALTH_6               1000 non-null   int64  
 45  R_HEALTH                 1000 non-null   float64
 46  R_HEALTH_INCOME          1000 non-null   float64
 47  R_HEALTH_SAVINGS         1000 non-null   float64
 48  R_HEALTH_DEBT            1000 non-null   float64
 49  T_HOUSING_12             1000 non-null   int64  
 50  T_HOUSING_6              1000 non-null   int64  
 51  R_HOUSING                1000 non-null   float64
 52  R_HOUSING_INCOME         1000 non-null   float64
 53  R_HOUSING_SAVINGS        1000 non-null   float64
 54  R_HOUSING_DEBT           1000 non-null   float64
 55  T_TAX_12                 1000 non-null   int64  
 56  T_TAX_6                  1000 non-null   int64  
 57  R_TAX                    1000 non-null   float64
 58  R_TAX_INCOME             1000 non-null   float64
 59  R_TAX_SAVINGS            1000 non-null   float64
 60  R_TAX_DEBT               1000 non-null   float64
 61  T_TRAVEL_12              1000 non-null   int64  
 62  T_TRAVEL_6               1000 non-null   int64  
 63  R_TRAVEL                 1000 non-null   float64
 64  R_TRAVEL_INCOME          1000 non-null   float64
 65  R_TRAVEL_SAVINGS         1000 non-null   float64
 66  R_TRAVEL_DEBT            1000 non-null   float64
 67  T_UTILITIES_12           1000 non-null   int64  
 68  T_UTILITIES_6            1000 non-null   int64  
 69  R_UTILITIES              1000 non-null   float64
 70  R_UTILITIES_INCOME       1000 non-null   float64
 71  R_UTILITIES_SAVINGS      1000 non-null   float64
 72  R_UTILITIES_DEBT         1000 non-null   float64
 73  T_EXPENDITURE_12         1000 non-null   int64  
 74  T_EXPENDITURE_6          1000 non-null   int64  
 75  R_EXPENDITURE            1000 non-null   float64
 76  R_EXPENDITURE_INCOME     1000 non-null   float64
 77  R_EXPENDITURE_SAVINGS    1000 non-null   float64
 78  R_EXPENDITURE_DEBT       1000 non-null   float64
 79  CAT_GAMBLING             1000 non-null   object 
 80  CAT_DEBT                 1000 non-null   int64  
 81  CAT_CREDIT_CARD          1000 non-null   int64  
 82  CAT_MORTGAGE             1000 non-null   int64  
 83  CAT_SAVINGS_ACCOUNT      1000 non-null   int64  
 84  CAT_DEPENDENTS           1000 non-null   int64  
 85  CREDIT_SCORE             1000 non-null   int64  
 86  DEFAULT                  1000 non-null   int64  
dtypes: float64(51), int64(34), object(2)
memory usage: 679.8+ KB
In [119]:
# Checking data types of each columns
print(df.dtypes)
CUST_ID                 object
INCOME                   int64
SAVINGS                  int64
DEBT                     int64
R_SAVINGS_INCOME       float64
                        ...   
CAT_MORTGAGE             int64
CAT_SAVINGS_ACCOUNT      int64
CAT_DEPENDENTS           int64
CREDIT_SCORE             int64
DEFAULT                  int64
Length: 87, dtype: object
In [120]:
# Count total number of each data types
print(df.dtypes.value_counts())
float64    51
int64      34
object      2
Name: count, dtype: int64
In [121]:
# Summary statistics 
print("Summary Statistics:")
df.describe()
Summary Statistics:
Out[121]:
INCOME SAVINGS DEBT R_SAVINGS_INCOME R_DEBT_INCOME R_DEBT_SAVINGS T_CLOTHING_12 T_CLOTHING_6 R_CLOTHING R_CLOTHING_INCOME ... R_EXPENDITURE_INCOME R_EXPENDITURE_SAVINGS R_EXPENDITURE_DEBT CAT_DEBT CAT_CREDIT_CARD CAT_MORTGAGE CAT_SAVINGS_ACCOUNT CAT_DEPENDENTS CREDIT_SCORE DEFAULT
count 1000.000000 1.000000e+03 1.000000e+03 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 ... 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000
mean 121610.019000 4.131896e+05 7.907180e+05 4.063477 6.068449 5.867252 6822.401000 3466.320000 0.454848 0.055557 ... 0.943607 0.913340 0.605276 0.944000 0.236000 0.173000 0.993000 0.15000 586.712000 0.284000
std 113716.699591 4.429160e+05 9.817904e+05 3.968097 5.847878 16.788356 7486.225932 5118.942977 0.236036 0.037568 ... 0.168989 1.625278 1.299382 0.230037 0.424835 0.378437 0.083414 0.35725 63.413882 0.451162
min 0.000000 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.003400 ... 0.666700 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 300.000000 0.000000
25% 30450.250000 5.971975e+04 5.396675e+04 1.000000 1.454500 0.206200 1084.500000 319.500000 0.263950 0.029700 ... 0.833300 0.158700 0.100000 1.000000 0.000000 0.000000 1.000000 0.00000 554.750000 0.000000
50% 85090.000000 2.738505e+05 3.950955e+05 2.545450 4.911550 2.000000 4494.000000 1304.000000 0.468850 0.046800 ... 0.909100 0.327950 0.178600 1.000000 0.000000 0.000000 1.000000 0.00000 596.000000 0.000000
75% 181217.500000 6.222600e+05 1.193230e+06 6.307100 8.587475 4.509600 10148.500000 4555.500000 0.626300 0.069400 ... 1.000000 0.833300 0.588200 1.000000 0.000000 0.000000 1.000000 0.00000 630.000000 1.000000
max 662094.000000 2.911863e+06 5.968620e+06 16.111200 37.000600 292.842100 43255.000000 39918.000000 1.058300 0.251700 ... 2.000200 10.009900 10.005300 1.000000 1.000000 1.000000 1.000000 1.00000 800.000000 1.000000

8 rows × 85 columns

Select variable & maximum likelihood estimation (MLE)¶

In [122]:
X = df["CREDIT_SCORE"].dropna().astype(float).values
n = len(X)

## MLE Estimators
mu_mle      = X.mean()
sigma2_mle  = ((X-mu_mle)**2).mean()
sigma2_unbiased = np.var(X, ddof=1)

print("=== MLE Estimation for CREDIT_SCORE ===")
print(f"n = {n}")
print(f"MLE mean (μ̂_MLE): {mu_mle:.4f}")
print(f"MLE variance (σ̂²_MLE, biased): {sigma2_mle:.4f}")
print(f"Unbiased variance (s²): {sigma2_unbiased:.4f}")
=== MLE Estimation for CREDIT_SCORE ===
n = 1000
MLE mean (μ̂_MLE): 586.7120
MLE variance (σ̂²_MLE, biased): 4017.2991
Unbiased variance (s²): 4021.3204

Confidence interval for $\mu$¶

In [123]:
# signficant level
alpha = 0.05
s     = np.sqrt(sigma2_unbiased)

# CI for mu knowing sigma
z     = norm.ppf(1 - alpha/2)
ci_mu_known = (
    mu_mle - z * s / np.sqrt(n)
    , mu_mle + z * s / np.sqrt(n)
)

# CI for mu knowing sigma
t_val = t.ppf(1 - alpha/2, df=n-1)
ci_mu_unknown = (
    mu_mle - t_val * s / np.sqrt(n)
    , mu_mle + t_val * s / np.sqrt(n)
)

print("\n=== Confidence Intervals for μ ===")
print(f"95% CI (σ known):    {ci_mu_known}")
print(f"95% CI (σ unknown): {ci_mu_unknown}")
=== Confidence Intervals for μ ===
95% CI (σ known):    (np.float64(582.7816391220821), np.float64(590.6423608779179))
95% CI (σ unknown): (np.float64(582.7768715135624), np.float64(590.6471284864376))

Confidence Interval for $\sigma^{2}$ using $\chi^{2}$¶

In [124]:
chi2_low  = chi2.ppf(alpha/2, df=n-1)
chi2_high = chi2.ppf(1 - alpha/2, df=n-1)

ci_sigma2 = (
    (n-1) * sigma2_unbiased / chi2_high
    , (n-1) * sigma2_unbiased / chi2_low
)

print("\n=== Confidence Interval for σ² (Chi-square) ===")
print(f"95% CI for variance σ²: {ci_sigma2}")
print(f"Approx CI for std deviation σ: ({np.sqrt(ci_sigma2[0]):.3f}, {np.sqrt(ci_sigma2[1]):.3f})")
=== Confidence Interval for σ² (Chi-square) ===
95% CI for variance σ²: (np.float64(3690.7182226729915), np.float64(4398.658343162247))
Approx CI for std deviation σ: (60.751, 66.322)

Hypothesis Tests¶

Hpythesis test on mean¶

In [125]:
# Test H0: mu=600 for credit score
mu0 = 600

T_stat      = (mu_mle - mu0) / (s / np.sqrt(n))
p_val_mean  = 2 * (1 - t.cdf(abs(T_stat), df=n-1))

print("\n=== Hypothesis Test: Mean of CREDIT_SCORE ===")
print(f"H0: μ = {mu0}")
print(f"T statistic = {T_stat:.4f}")
print(f"p-value     = {p_val_mean:.4f}")
=== Hypothesis Test: Mean of CREDIT_SCORE ===
H0: μ = 600
T statistic = -6.6264
p-value     = 0.0000

This result shows that a one-sample t-test was performed to assess whether the mean CREDIT_SCORE equals 600. The result (T = −6.626, p < 0.001) indicates that the null hypothesis can be rejected. The population mean is significantly lower than 600.

Hypothesis test on variance¶

$$ H_{0}: \sigma^{2} = 60^{2} $$
In [126]:
#  Variance test 
sigma0      = 60**2  
V_stat      = (n-1) * sigma2_unbiased / sigma0
p_val_var   = 2 * min(chi2.cdf(V_stat, df=n-1),
                    1 - chi2.cdf(V_stat, df=n-1))

print("\n=== Hypothesis Test: Variance of CREDIT_SCORE ===")
print(f"H0: σ² = {sigma0}")
print(f"Chi-square statistic = {V_stat:.4f}")
print(f"p-value              = {p_val_var:.4f}")
=== Hypothesis Test: Variance of CREDIT_SCORE ===
H0: σ² = 3600
Chi-square statistic = 1115.9164
p-value              = 0.0113

Independent Test¶

In [127]:
## CAT_GAMBLING VS DEFAULT
contingency = pd.crosstab(df["CAT_GAMBLING"], df["DEFAULT"])
chi2_stat, p_ind, dof, exp = chi2_contingency(contingency)

print("\n=== Independence Test: CAT_GAMBLING vs DEFAULT ===")
print(f"Chi2 statistic = {chi2_stat:.4f}")
print(f"p-value        = {p_ind:.4f}")
print(f"Degrees of freedom = {dof}")
=== Independence Test: CAT_GAMBLING vs DEFAULT ===
Chi2 statistic = 2.6778
p-value        = 0.2621
Degrees of freedom = 2

Gambling behavior does not appear to influence whether a customer defaults or not, based on this dataset.

Normality Check¶

In [128]:
X_std       = (X - mu_mle) / s
ks_stat, ks_p = kstest(X_std, 'norm')

print("\n=== Kolmogorov–Smirnov Normality Test (CREDIT_SCORE) ===")
print(f"KS statistic = {ks_stat:.4f}")
print(f"p-value      = {ks_p:.4f}")

# QQ-plot
plt.figure(figsize=(6,6))
stats.probplot(X, dist="norm", plot=plt)
plt.title("QQ Plot for CREDIT_SCORE")
plt.show()
=== Kolmogorov–Smirnov Normality Test (CREDIT_SCORE) ===
KS statistic = 0.0659
p-value      = 0.0003

The Kolmogorov–Smirnov test (p = 0.0003) rejects the null hypothesis of normality for CREDIT_SCORE. The QQ-plot also shows systematic deviations from the theoretical Gaussian line, with heavy tails and curvature. Therefore, CREDIT_SCORE cannot be considered normally distributed. This is where we can transform the distribution using such as (log, Box-Cox, Yeo-Johnson).

3. Data Preprocessing¶

Data preprocessing is a crucial step to ensure the quality of the data and to produce the best possible results. We cleaned and prepared the dataset so that all features could be properly used for different data analysis techniques and machine learning models.

In [129]:
## fill up the nan values
df  = df.replace([np.inf, -np.inf], np.nan)
df  = df.fillna(0)

df  = df.drop(columns=["CUST_ID"])

The dataset also contained two categorical variables: CUST_ID, which was removed because it serves only as an identifier, and CAT_GAMBLING, which was encoded into numerical values by converting the categories "no", "low", and "high" into 0, 1, and 2 respectively.

We do not remove outlier here, because credit score datasets naturally contain extreme behaviors (high debt, zero savings). These are meaningful and should remain in the model. Also drop ID out since it is just an identifier.

3.1. Detecting Missing values and Duplicates¶

In [130]:
N_A = df.isna().sum().sort_values(ascending=False)
N_A, N_A.sum()
Out[130]:
(INCOME                 0
 SAVINGS                0
 DEBT                   0
 R_SAVINGS_INCOME       0
 R_DEBT_INCOME          0
                       ..
 CAT_MORTGAGE           0
 CAT_SAVINGS_ACCOUNT    0
 CAT_DEPENDENTS         0
 CREDIT_SCORE           0
 DEFAULT                0
 Length: 86, dtype: int64,
 np.int64(0))
In [131]:
# duplicate
duplicate_rows = df[df.duplicated(keep=False)]

duplicate_row  = df.duplicated().sum()
print("Sum total duplicates = ", duplicate_row)
Sum total duplicates =  0

After checking we can observe that the dataset doesn't contain any missing values and duplicate values.

3.2. Numerical conversion¶

CAT_GAMBLING columns was encoded into numerical values by converting the categories "no", "low", and "high" into 0, 1, and 2 respectively.

In [132]:
cat_cols       = df.select_dtypes(include=['object', 'category']).columns.tolist()
print("Categorical columns:", cat_cols)
Categorical columns: ['CAT_GAMBLING']
In [133]:
df["CAT_GAMBLING"].value_counts()
Out[133]:
CAT_GAMBLING
No      620
High    264
Low     116
Name: count, dtype: int64
In [134]:
gambling            = {'High':2,'Low':1,'No':0}
df["CAT_GAMBLING"]  = df["CAT_GAMBLING"].map(gambling)
df.head()
Out[134]:
INCOME SAVINGS DEBT R_SAVINGS_INCOME R_DEBT_INCOME R_DEBT_SAVINGS T_CLOTHING_12 T_CLOTHING_6 R_CLOTHING R_CLOTHING_INCOME ... R_EXPENDITURE_SAVINGS R_EXPENDITURE_DEBT CAT_GAMBLING CAT_DEBT CAT_CREDIT_CARD CAT_MORTGAGE CAT_SAVINGS_ACCOUNT CAT_DEPENDENTS CREDIT_SCORE DEFAULT
0 33269 0 532304 0.0000 16.0000 1.2000 1889 945 0.5003 0.0568 ... 0.0000 0.0625 2 1 0 0 0 0 444 1
1 77158 91187 315648 1.1818 4.0909 3.4615 5818 111 0.0191 0.0754 ... 0.7692 0.2222 0 1 0 0 1 0 625 0
2 30917 21642 534864 0.7000 17.3000 24.7142 1157 860 0.7433 0.0374 ... 1.4286 0.0578 2 1 0 0 1 0 469 1
3 80657 64526 629125 0.8000 7.8000 9.7499 6857 3686 0.5376 0.0850 ... 1.2500 0.1282 2 1 0 0 1 0 559 0
4 149971 1172498 2399531 7.8182 16.0000 2.0465 1978 322 0.1628 0.0132 ... 0.1163 0.0568 2 1 1 1 1 1 473 0

5 rows × 86 columns

3.3. Variance Thresholding¶

In [135]:
variances = df.var().sort_values()

print(f"Variances = {variances.values}")
plt.figure(figsize=(10,4))
plt.plot(variances.values)
plt.title("Variance of Features")
plt.ylabel("Variance")
plt.show()
Variances = [1.33032933e-08 6.13660050e-07 1.93307207e-06 3.45459504e-06
 3.64886558e-06 1.12808807e-05 2.97442591e-04 3.64860707e-04
 4.53116219e-04 4.93765029e-04 6.66274890e-04 1.07865003e-03
 1.36812647e-03 1.41138794e-03 1.96084766e-03 2.04296282e-03
 2.73081684e-03 2.78940389e-03 3.27219146e-03 3.55618131e-03
 3.91643733e-03 4.02426458e-03 5.48148731e-03 6.35846090e-03
 6.46103765e-03 6.95795796e-03 7.13498630e-03 7.90829318e-03
 9.54764292e-03 1.22973155e-02 1.87081946e-02 1.98485994e-02
 2.00215421e-02 2.85571529e-02 3.21955423e-02 3.87907156e-02
 4.51129519e-02 4.93700844e-02 5.15900335e-02 5.29169169e-02
 5.57131097e-02 5.58459958e-02 5.62393284e-02 5.87347245e-02
 8.18204286e-02 1.27627628e-01 1.43214214e-01 1.80484484e-01
 2.03547548e-01 2.44957755e-01 3.33654442e-01 6.20261792e-01
 7.58022022e-01 1.68839248e+00 2.64152938e+00 1.57457917e+01
 3.41976728e+01 2.81848902e+02 4.02132038e+03 7.83387427e+03
 1.85427467e+04 5.54892274e+06 6.60735188e+06 1.00664006e+07
 1.14652552e+07 1.26127311e+07 2.15555279e+07 2.50716269e+07
 2.62035772e+07 2.82683687e+07 3.98629189e+07 4.99241214e+07
 5.43827157e+07 5.60435787e+07 8.19987580e+07 1.06364134e+08
 1.53467194e+08 3.20138940e+08 3.68920740e+08 4.97543175e+08
 1.28321569e+09 2.48541526e+09 7.96559696e+09 1.29314878e+10
 1.96174616e+11 9.63912373e+11]

3.4. Outlier detection¶

To better check the anomaly, it is great to see how the data distributed within the boxplot visualization.

In [136]:
num_cols    = df.select_dtypes(include=['int64', 'float64']).columns

cols        = num_cols[:8]

fig, axes   = plt.subplots(2, 4, figsize=(20, 10))

for idx, col in enumerate(cols):
    row     = idx // 4    
    col_pos = idx % 4  

    axes[row, col_pos].boxplot(df[col], vert=True, patch_artist=True,
                            boxprops=dict(facecolor='orange'))
    axes[row, col_pos].set_title(f"Box Plot for {col}")
    axes[row, col_pos].set_ylabel(col)

plt.tight_layout()
plt.show()

In this situation, We decide not to remove the outliers because they are the customers that we want to detect.

For example, people with extremely high debt, people with zero saving. These outliers are high-risk cases.

We use common detect method which is Interquatile Range (IQR) rule.

For each numerical variable $X$

$$ Q_{1} = \text{25-th percentile} \qquad Q_{3}=\text{75-th percentile} $$

And obersavtion $x_{i}$ is classified as an outlier if $$ x_{i} < Q_{1} - 1.25IQR \qquad \text{or} \qquad x_{i} > Q_{3} + 1.5IQR$$

In [137]:
iqr_columns = [
    "INCOME", "SAVINGS", "DEBT",
    "R_SAVINGS_INCOME", "R_DEBT_INCOME",
    "CREDIT_SCORE"
]

for col in iqr_columns:
    Q1  = df[col].quantile(0.25)
    Q3  = df[col].quantile(0.75)
    IQR = Q3 - Q1

    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    
    df[col] = df[col].clip(lower, upper)
In [138]:
# Reset index after dropping outliers to avoid indexing issues
df = df.reset_index(drop=True)
print(f"Index reset. DataFrame shape: {df.shape}")
Index reset. DataFrame shape: (1000, 86)
In [139]:
fig, axes = plt.subplots(2, 4, figsize=(20, 10))

for idx, col in enumerate(iqr_columns):
    r = idx // 4
    c = idx % 4

    axes[r, c].boxplot(df[col], patch_artist=True,
                    boxprops=dict(facecolor='orange'))
    axes[r, c].set_title(f"{col} (After IQR Winsorization)")

plt.tight_layout()
plt.show()

We applied IQR-based outlier handling only to selected features (INCOME, SAVINGS, DEBT, CREDIT_SCORE, and moderate ratios). These variables have distributions where IQR is meaningful. We did not apply IQR to long-tailed or categorical variables to avoid removing informative extreme financial behaviors.

3.5. Histogram distribution¶

In [140]:
plots_per_page  = 12
cols_per_row    = 4
rows_per_page   = 3

for i in range(0, len(num_cols), plots_per_page):
    subset      = num_cols[i:i+plots_per_page]

    fig, axes   = plt.subplots(rows_per_page, cols_per_row, figsize=(20, 12))

    axes        = axes.flatten()

    for j, col in enumerate(subset):
        axes[j].hist(df[col], bins=30, color='skyblue', edgecolor='black')
        axes[j].set_title(col)

    # Hide unused axes
    for k in range(len(subset), plots_per_page):
        axes[k].set_visible(False)

    plt.tight_layout()
    plt.show()

3.6. Feature Scaling¶

Group columns to do log-transformation

In [141]:
# Group 1 — Heavy-tailed monetary variables
money_cols = ["INCOME", "SAVINGS", "DEBT"]

# Group 2 — All T_* expenditure features
t_cols = [col for col in df.columns if col.startswith("T_")]

# Group 3 — Extreme ratio variables (_SAVINGS, _DEBT)
extreme_ratio_cols = [
    col for col in df.columns
    if col.endswith("_SAVINGS") or col.endswith("_DEBT")
]

# Group 4 — Moderately skewed _INCOME ratios
income_ratio_cols = [
    col for col in df.columns
    if col.endswith("_INCOME")
]

# Group 5 — Balanced ratio variables (no transform expected)
balanced_ratio_cols = [
    col for col in df.columns
    if col.startswith("R_")
    and col not in extreme_ratio_cols
    and col not in income_ratio_cols
]

# Group 6 — Binary categorical variables
binary_cols = [col for col in df.columns if col.startswith("CAT_")]

# Group 7 — Target variables
target_cols = ["CREDIT_SCORE", "DEFAULT"]
In [142]:
def plot_group(columns, title, df, cols_per_row=4):
    if len(columns) == 0:
        return
    
    rows        = (len(columns) + cols_per_row - 1) // cols_per_row
    fig, axes   = plt.subplots(rows, cols_per_row, figsize=(20, rows * 4))
    axes        = axes.flatten()

    for i, col in enumerate(columns):
        axes[i].hist(df[col], bins=30, color='skyblue', edgecolor='black')
        axes[i].set_title(col)

    # Hide empty axes
    for j in range(len(columns), len(axes)):
        axes[j].set_visible(False)

    fig.suptitle(title, fontsize=20)
    plt.tight_layout()
    plt.show()
In [143]:
plot_group(money_cols, "Group 1 — Heavy-Tailed Monetary Variables\n", df)
plot_group(t_cols, "Group 2 — T_* Expenditure Variables\n", df)
plot_group(extreme_ratio_cols, "Group 3 — Extreme Ratio Variables (_SAVINGS & _DEBT)\n", df)
plot_group(income_ratio_cols, "Group 4 — Moderately Skewed Income Ratios\n", df)
plot_group(balanced_ratio_cols, "Group 5 — Balanced Ratio Variables\n", df)
plot_group(binary_cols, "Group 6 — Binary Variables\n", df)
plot_group(target_cols, "Group 7 — Target Variables\n", df)
In [144]:
# 1. Heavy monetary variables
must_transform = set(money_cols)

# 2. All T_* expenditure variables
must_transform.update(t_cols)

# 3. Extreme *_SAVINGS and *_DEBT ratios (exclude binary CAT_DEBT)
must_transform.update([c for c in extreme_ratio_cols if c != "CAT_DEBT"])

# 4. Strongly skewed income ratios (subset of income_ratio_cols)
strong_income_transform = [
    "R_SAVINGS_INCOME"
    , "R_DEBT_INCOME"
    , "R_EDUCATION_INCOME"
    , "R_ENTERTAINMENT_INCOME"
    , "R_GAMBLING_INCOME"
    , "R_HEALTH_INCOME"
    , "R_TAX_INCOME"
    , "R_TRAVEL_INCOME"
]

must_transform.update(strong_income_transform)

Does transform really help in PCA? => Yes, because in PCA, it already assumes that the features are roughly normal.

3.7. Log transformation¶

Skewness Correction.¶

To reduce skewness, we apply transformations that make the distribution more symmetric.

  • Log1p Transform $$ x' = log(1+x)$$

  • Yeo-Jonhson Transform Generalizes form of Box-Cox and workks for zero and negative values. It automaticlLY select a power parameter $\lambda$ to minimize skewness.

    • For $x \ge 0$: $$ T(x;\lambda) = \begin{cases} \dfrac{(x+1)^{\lambda} - 1}{\lambda}, & \lambda \ne 0, \\[8pt] \ln(x+1), & \lambda = 0. \end{cases} $$

    • For $x < 0$: $$ T(x;\lambda) = \begin{cases} -\dfrac{(-x+1)^{\, 2-\lambda} - 1}{\, 2-\lambda}, & \lambda \ne 2, \\[8pt] -\ln(-x+1), & \lambda = 2. \end{cases} $$

Both transformations are applied only to variables where:

$$ | \text(skew)(X) | > 0.75 $$
In [145]:
df_original     = df.copy()

num_cols        = df.select_dtypes(include=['float64', 'int64']).columns
skip_cols       = ["CREDIT_SCORE", "DEFAULT"] + [col for col in df.columns if col.startswith("CAT_")]

skew_threshold = 0.75
transform_cols = [
    col for col in num_cols
    if col not in skip_cols and abs(skew(df[col])) > skew_threshold
]

df_log                  = df_original.copy()
df_log[transform_cols]  = np.log1p(df_original[transform_cols])

def compare_hist_log(col):
    fig, axes = plt.subplots(1, 2, figsize=(12,4))

    axes[0].hist(df_original[col], bins=40, color='skyblue', edgecolor='black')
    axes[0].set_title(f"{col} — BEFORE (Original)")

    axes[1].hist(df_log[col], bins=40, color='salmon', edgecolor='black')
    axes[1].set_title(f"{col} — AFTER log1p")

    plt.tight_layout()
    plt.show()

pt      = PowerTransformer(method="yeo-johnson")
df_yeo  = df_original.copy()
df_yeo[transform_cols] = pt.fit_transform(df_original[transform_cols])

def compare_hist_yeojohnson(col):
    fig, axes = plt.subplots(1, 2, figsize=(12,4))

    axes[0].hist(df_original[col], bins=40, color='skyblue', edgecolor='black')
    axes[0].set_title(f"{col} — BEFORE (Original)")

    axes[1].hist(df_yeo[col], bins=40, color='salmon', edgecolor='black')
    axes[1].set_title(f"{col} — AFTER (Yeo-Johnson)")

    plt.tight_layout()
    plt.show()

for col in transform_cols[:6]:
    compare_hist_log(col)
    compare_hist_yeojohnson(col)

df  = df_yeo.copy()

They all now has a better balancing distribution. We can come back again and compare the result between the original and transformed one.

3.8. Correlation Analysis¶

Do correlation analysis before scaling.

In [146]:
pearson_corr = df.corr(method='pearson')

plt.figure(figsize=(16, 12))
sns.heatmap(pearson_corr, cmap='coolwarm', center=0)
plt.title("Pearson Correlation Heatmap", fontsize=20)
plt.show()
In [147]:
# Check in 5 columns
sns.heatmap(pearson_corr.iloc[: 5, : 5], annot=True)
plt.figure(figsize =(3,2))
plt.show()
<Figure size 300x200 with 0 Axes>

Check high-correlated pairs

In [148]:
# We take 85% threshold
threshold = 0.85

high_corr_pairs = (
    pearson_corr
    .where(np.triu(np.ones(pearson_corr.shape), k=1).astype(bool))
    .stack()
    .reset_index()
)

high_corr_pairs.columns = ['Feature_1', 'Feature_2', 'Correlation']

# Filter by threshold and remove old index numbers
high_corr_pairs = (
    high_corr_pairs[high_corr_pairs['Correlation'].abs() >= threshold]
    .sort_values(by="Correlation", ascending=False)
    .reset_index(drop=True)
)

print(high_corr_pairs[1:5])
        Feature_1      Feature_2  Correlation
1  T_EDUCATION_12  T_EDUCATION_6     0.999965
2    T_HOUSING_12    T_HOUSING_6     0.999961
3   T_GAMBLING_12   T_GAMBLING_6     0.999922
4        T_TAX_12        T_TAX_6     0.999163

4. Exploratory Data Analysis(EDA)¶

The Exploratory Data Analysis (EDA) in this project was conducted on the credit score dataset. The primary goal was to understand the distribution of the target variables CREDIT_SCORE and DEFAULT and identify key features that correlate most strongly with customer default risk.

4.1. Target Variable Analysis¶

Select and Count the number of numerical columns and categorical or objects columns¶

In [149]:
object_cols     = df.select_dtypes(include=['object']).columns.tolist()
number_object   = len(object_cols)

numeric_cols        = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
number_numerical    = len(numeric_cols)

print(f"\nThe total objects columns in dataset is: {object_cols} \n==> equal {number_object}")
print(f"The total numerical columns in dataset is: {numeric_cols} \n==> equal {number_numerical}")
The total objects columns in dataset is: [] 
==> equal 0
The total numerical columns in dataset is: ['INCOME', 'SAVINGS', 'DEBT', 'R_SAVINGS_INCOME', 'R_DEBT_INCOME', 'R_DEBT_SAVINGS', 'T_CLOTHING_12', 'T_CLOTHING_6', 'R_CLOTHING', 'R_CLOTHING_INCOME', 'R_CLOTHING_SAVINGS', 'R_CLOTHING_DEBT', 'T_EDUCATION_12', 'T_EDUCATION_6', 'R_EDUCATION', 'R_EDUCATION_INCOME', 'R_EDUCATION_SAVINGS', 'R_EDUCATION_DEBT', 'T_ENTERTAINMENT_12', 'T_ENTERTAINMENT_6', 'R_ENTERTAINMENT', 'R_ENTERTAINMENT_INCOME', 'R_ENTERTAINMENT_SAVINGS', 'R_ENTERTAINMENT_DEBT', 'T_FINES_12', 'T_FINES_6', 'R_FINES', 'R_FINES_INCOME', 'R_FINES_SAVINGS', 'R_FINES_DEBT', 'T_GAMBLING_12', 'T_GAMBLING_6', 'R_GAMBLING', 'R_GAMBLING_INCOME', 'R_GAMBLING_SAVINGS', 'R_GAMBLING_DEBT', 'T_GROCERIES_12', 'T_GROCERIES_6', 'R_GROCERIES', 'R_GROCERIES_INCOME', 'R_GROCERIES_SAVINGS', 'R_GROCERIES_DEBT', 'T_HEALTH_12', 'T_HEALTH_6', 'R_HEALTH', 'R_HEALTH_INCOME', 'R_HEALTH_SAVINGS', 'R_HEALTH_DEBT', 'T_HOUSING_12', 'T_HOUSING_6', 'R_HOUSING', 'R_HOUSING_INCOME', 'R_HOUSING_SAVINGS', 'R_HOUSING_DEBT', 'T_TAX_12', 'T_TAX_6', 'R_TAX', 'R_TAX_INCOME', 'R_TAX_SAVINGS', 'R_TAX_DEBT', 'T_TRAVEL_12', 'T_TRAVEL_6', 'R_TRAVEL', 'R_TRAVEL_INCOME', 'R_TRAVEL_SAVINGS', 'R_TRAVEL_DEBT', 'T_UTILITIES_12', 'T_UTILITIES_6', 'R_UTILITIES', 'R_UTILITIES_INCOME', 'R_UTILITIES_SAVINGS', 'R_UTILITIES_DEBT', 'T_EXPENDITURE_12', 'T_EXPENDITURE_6', 'R_EXPENDITURE', 'R_EXPENDITURE_INCOME', 'R_EXPENDITURE_SAVINGS', 'R_EXPENDITURE_DEBT', 'CAT_GAMBLING', 'CAT_DEBT', 'CAT_CREDIT_CARD', 'CAT_MORTGAGE', 'CAT_SAVINGS_ACCOUNT', 'CAT_DEPENDENTS', 'CREDIT_SCORE', 'DEFAULT'] 
==> equal 86

Distribution and Statistic¶

In [150]:
# Distribution of CREDIT_SCORE
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Histogram with KDE
sns.histplot(df['CREDIT_SCORE'], kde=True, ax=axes[0], color='steelblue')
axes[0].set_title('Distribution of CREDIT_SCORE')
axes[0].axvline(df['CREDIT_SCORE'].mean(), color='red', linestyle='--', label=f"Mean: {df['CREDIT_SCORE'].mean():.1f}")
axes[0].axvline(df['CREDIT_SCORE'].median(), color='green', linestyle='--', label=f"Median: {df['CREDIT_SCORE'].median():.1f}")
axes[0].legend()

# Box plot
sns.boxplot(y=df['CREDIT_SCORE'], ax=axes[1], color='steelblue')
axes[1].set_title('Box Plot of CREDIT_SCORE')

# CREDIT_SCORE by DEFAULT
sns.boxplot(x='DEFAULT', y='CREDIT_SCORE', data=df, ax=axes[2], palette='Set2')
axes[2].set_title('CREDIT_SCORE by DEFAULT Status')
axes[2].set_xticklabels(['No Default (0)', 'Default (1)'])

plt.tight_layout()
plt.show()

# Statistics
print("CREDIT_SCORE Statistics:")
print(f"  Skewness: {df['CREDIT_SCORE'].skew():.3f}")
print(f"  Kurtosis: {df['CREDIT_SCORE'].kurtosis():.3f}")
print(f"\nCREDIT_SCORE by DEFAULT:")
print(df.groupby('DEFAULT')['CREDIT_SCORE'].describe())
df.groupby('DEFAULT')['CREDIT_SCORE'].describe()
/tmp/ipykernel_4647/946892219.py:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y='CREDIT_SCORE', data=df, ax=axes[2], palette='Set2')
/tmp/ipykernel_4647/946892219.py:18: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[2].set_xticklabels(['No Default (0)', 'Default (1)'])
CREDIT_SCORE Statistics:
  Skewness: -0.531
  Kurtosis: 0.121

CREDIT_SCORE by DEFAULT:
         count        mean        std      min     25%    50%     75%      max
DEFAULT                                                                       
0        716.0  600.312675  51.735251  441.875  566.00  606.5  635.25  742.875
1        284.0  558.238996  62.273254  441.875  516.75  565.5  609.00  691.000
Out[150]:
count mean std min 25% 50% 75% max
DEFAULT
0 716.0 600.312675 51.735251 441.875 566.00 606.5 635.25 742.875
1 284.0 558.238996 62.273254 441.875 516.75 565.5 609.00 691.000
  • Distribution of CREDIT_SCORE (Histogram): The histogram shows that the credit score distribution is slightly skewed to the left (negative skewness of -1.005), meaning there are a noticeable number of customers with higher scores clustered near the maximum range. The mean score is 586.7, and the median is 596.0. This difference confirms the slight negative skew, as the median is higher than the mean.
  • Box Plot of CREDIT_SCORE: The box plot visually confirms the distribution. The central box sits within the 500 to 700 range, and there are a few outlier points visible in the lower score region, representing customers with very low credit scores.
  • Box Plot CREDIT_SCORE by DEFAULT Status: This box plot compares the credit score distributions for customers who did not default (0) versus those who did (1). There is a distinct difference between the two groups. Customers who did not default (0) have a significantly higher median credit score and a tighter distribution, while customers who defaulted (1) have a lower median and a wider spread in scores. This clearly shows that CREDIT_SCORE is strongly inversely related to the risk of DEFAULT.

4.2. Credit Score Segmentation¶

The Credit Score was segmented into four categories (Poor, Fair, Good, Excellent) to analyze the default rate across risk levels.

  • Poor category shows the highest default rate.
  • Excellent category shows the lowest default rate.
Credit_Score Category
score < 500 Poor
500 <= score < 600 Fair
600 <= score < 700 Good
score >= 700 Excellent
In [151]:
# Credit Score Segmentation
def credit_score_category(score):
    if score < 500:
        return 'Poor'
    elif score < 600:
        return 'Fair'
    elif score < 700:
        return 'Good'
    else:
        return 'Excellent'

df['CREDIT_CATEGORY']   = df['CREDIT_SCORE'].apply(credit_score_category)

# Analysis by category
fig, axes               = plt.subplots(1, 2, figsize=(14, 5))

# Count by category
category_order  = ['Poor', 'Fair', 'Good', 'Excellent']
sns.countplot(x='CREDIT_CATEGORY', data=df, order=category_order, ax=axes[0], palette='RdYlGn')
axes[0].set_title('Distribution by Credit Score Category')
axes[0].set_xlabel('Credit Category')

# Default rate by category
default_rate = df.groupby('CREDIT_CATEGORY')['DEFAULT'].mean().reindex(category_order)
default_rate.plot(kind='bar', ax=axes[1], color=['red', 'orange', 'yellowgreen', 'green'])
axes[1].set_title('Default Rate by Credit Score Category')
axes[1].set_ylabel('Default Rate')
axes[1].set_xticklabels(category_order, rotation=0)

plt.tight_layout()
plt.show()

# Cross-tabulation
print("Cross-tabulation: CREDIT_CATEGORY vs DEFAULT")
print(pd.crosstab(df['CREDIT_CATEGORY'], df['DEFAULT'], margins=True, normalize='index').round(3))
/tmp/ipykernel_4647/934784022.py:19: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x='CREDIT_CATEGORY', data=df, order=category_order, ax=axes[0], palette='RdYlGn')
Cross-tabulation: CREDIT_CATEGORY vs DEFAULT
DEFAULT              0      1
CREDIT_CATEGORY              
Excellent        1.000  0.000
Fair             0.677  0.323
Good             0.814  0.186
Poor             0.333  0.667
All              0.716  0.284

4.3. Correlation Analysis with Target Variables¶

Top Correlated Features¶

In [152]:
# Top correlated features with CREDIT_SCORE
credit_corr = df[numeric_cols].corr()['CREDIT_SCORE'].drop('CREDIT_SCORE').sort_values(key=abs, ascending=False)

print("Top 15 features correlated with CREDIT_SCORE:")
print(credit_corr.head(15))
print("\n" + "="*50)

# Top correlated features with DEFAULT
default_corr = df[numeric_cols].corr()['DEFAULT'].drop('DEFAULT').sort_values(key=abs, ascending=False)

print("\nTop 15 features correlated with DEFAULT:")
print(default_corr.head(15))
Top 15 features correlated with CREDIT_SCORE:
R_DEBT_INCOME          -0.781501
R_EXPENDITURE_DEBT      0.617799
R_DEBT_SAVINGS         -0.606353
R_TAX_DEBT              0.597723
R_UTILITIES_DEBT        0.569899
R_ENTERTAINMENT_DEBT    0.560518
R_CLOTHING_DEBT         0.527886
R_GROCERIES_DEBT        0.515530
R_HEALTH_DEBT           0.500986
R_TRAVEL_DEBT           0.416605
DEBT                   -0.412932
R_HOUSING_DEBT          0.350996
R_UTILITIES_SAVINGS    -0.344134
DEFAULT                -0.326766
R_HEALTH_SAVINGS       -0.268628
Name: CREDIT_SCORE, dtype: float64

==================================================

Top 15 features correlated with DEFAULT:
CREDIT_SCORE           -0.326766
R_DEBT_INCOME           0.253076
R_TAX_DEBT             -0.246836
R_EXPENDITURE_DEBT     -0.208215
R_ENTERTAINMENT_DEBT   -0.199903
R_CLOTHING_DEBT        -0.189810
R_DEBT_SAVINGS          0.187416
R_HEALTH_DEBT          -0.177106
R_UTILITIES_DEBT       -0.174043
R_GROCERIES_DEBT       -0.157802
DEBT                    0.125168
R_TRAVEL_DEBT          -0.124004
CAT_CREDIT_CARD         0.119993
R_GROCERIES_SAVINGS     0.103225
R_UTILITIES_SAVINGS     0.095051
Name: DEFAULT, dtype: float64
In [153]:
# Visualize top correlated features
fig, axes   = plt.subplots(1, 2, figsize=(16, 6))

# Top 10 for CREDIT_SCORE
top_credit  = credit_corr.head(10)
colors      = ['green' if x > 0 else 'red' for x in top_credit.values]
top_credit.plot(kind='barh', ax=axes[0], color=colors)
axes[0].set_title('Top 10 Features Correlated with CREDIT_SCORE')
axes[0].set_xlabel('Correlation')

# Top 10 for DEFAULT
top_default = default_corr.head(10)
colors      = ['green' if x > 0 else 'red' for x in top_default.values]
top_default.plot(kind='barh', ax=axes[1], color=colors)
axes[1].set_title('Top 10 Features Correlated with DEFAULT')
axes[1].set_xlabel('Correlation')

plt.tight_layout()
plt.show()
  • Top Features Correlated with CREDIT_SCORE: The strongest correlation is with features like R_EXPENDITURE_DEBT (positive) and R_DEBT_INCOME (negative). The highest correlated features are ratio variables that directly measure a customer's financial stability and leverage. High savings and low debt are consistently linked to a higher CREDIT_SCORE while a high ratio of debt to income should lower it.

  • Top Features Correlated with DEFAULT: The features correlated with DEFAULT are similar but with inverted signs, as expected (since high score means low default risk). R_DEBT_INCOME has a strong positive correlation, meaning high debt relative to income increases the likelihood of default. The strongest correlation seen is with R_EXPENDITURE_DEBT (positive) and T_HOUSING_6 (negative). The features that strongly predict a higher CREDIT_SCORE are the inverse predictors of DEFAULT.

Scatter plots for top 6 features vs CREDIT_SCORE¶

In [154]:
top_features    = credit_corr.head(6).index.tolist()

fig, axes       = plt.subplots(2, 3, figsize=(18, 10))
axes            = axes.flatten()

for i, feature in enumerate(top_features):
    axes[i].scatter(df[feature], df['CREDIT_SCORE'], alpha=0.5, c=df['DEFAULT'], cmap='coolwarm')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('CREDIT_SCORE')
    axes[i].set_title(f'{feature} vs CREDIT_SCORE (r={credit_corr[feature]:.3f})')

plt.tight_layout()
plt.show()

Box plots for top 6 features by DEFAULT status¶

In [155]:
top_default_features = default_corr.head(6).index.tolist()

fig, axes   = plt.subplots(2, 3, figsize=(18, 10))
axes        = axes.flatten()

for i, feature in enumerate(top_default_features):
    sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
    axes[i].set_title(f'{feature} by DEFAULT (r={default_corr[feature]:.3f})')
    axes[i].set_xticklabels(['No Default', 'Default'])

plt.tight_layout()
plt.show()
/tmp/ipykernel_4647/2839064703.py:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
/tmp/ipykernel_4647/2839064703.py:9: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(['No Default', 'Default'])
/tmp/ipykernel_4647/2839064703.py:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
/tmp/ipykernel_4647/2839064703.py:9: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(['No Default', 'Default'])
/tmp/ipykernel_4647/2839064703.py:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
/tmp/ipykernel_4647/2839064703.py:9: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(['No Default', 'Default'])
/tmp/ipykernel_4647/2839064703.py:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
/tmp/ipykernel_4647/2839064703.py:9: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(['No Default', 'Default'])
/tmp/ipykernel_4647/2839064703.py:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
/tmp/ipykernel_4647/2839064703.py:9: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(['No Default', 'Default'])
/tmp/ipykernel_4647/2839064703.py:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
/tmp/ipykernel_4647/2839064703.py:9: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(['No Default', 'Default'])

Six box plots were generated to visualize the relationship between the top-correlated numerical features and the DEFAULT status.

  • R_DEBT_INCOME:
    The defaulter group (1) has a clearly higher median ratio than the non-defaulter group (0), confirming the 0.251 correlation. This suggests that customers with high debt relative to their income are more likely to default.

  • R_DEBT_SAVINGS:
    Similar to R_DEBT_INCOME, the median R_DEBT_SAVINGS is significantly higher for defaulters, highlighting that having much more debt than savings is a strong indicator of risk.

  • R_EXPENDITURE_DEBT:
    This ratio shows a slightly lower median for defaulters, though the difference is less pronounced than for the income and savings ratios.

Analysis of Binary Categorical Features vs. DEFAULT¶

In [156]:
# Analysis of binary categorical features vs DEFAULT
binary_cats = ['CAT_DEBT', 'CAT_CREDIT_CARD', 'CAT_MORTGAGE', 'CAT_SAVINGS_ACCOUNT', 'CAT_DEPENDENTS']

fig, axes   = plt.subplots(2, 3, figsize=(18, 10))
axes        = axes.flatten()

for i, cat in enumerate(binary_cats):
    # Default rate by category
    default_rate        = df.groupby(cat)['DEFAULT'].mean()
    default_rate.plot(kind='bar', ax=axes[i], color=['steelblue', 'coral'])
    axes[i].set_title(f'Default Rate by {cat}')
    axes[i].set_ylabel('Default Rate')
    axes[i].set_xticklabels(['No (0)', 'Yes (1)'], rotation=0)
    
    # Add values on bars
    for j, v in enumerate(default_rate.values):
        axes[i].text(j, v + 0.01, f'{v:.2%}', ha='center')

axes[5].set_visible(False)
plt.tight_layout()
plt.show()

# Summary statistics
print("Default Rate by Binary Categorical Features:")
for cat in binary_cats:
    rate = df.groupby(cat)['DEFAULT'].mean()
    print(f"\n{cat}:")
    print(f"  No (0): {rate.get(0, 0):.2%}")
    print(f"  Yes (1): {rate.get(1, 0):.2%}")
Default Rate by Binary Categorical Features:

CAT_DEBT:
  No (0): 21.43%
  Yes (1): 28.81%

CAT_CREDIT_CARD:
  No (0): 25.39%
  Yes (1): 38.14%

CAT_MORTGAGE:
  No (0): 27.57%
  Yes (1): 32.37%

CAT_SAVINGS_ACCOUNT:
  No (0): 42.86%
  Yes (1): 28.30%

CAT_DEPENDENTS:
  No (0): 27.18%
  Yes (1): 35.33%

The analysis included several binary categorical features (such as CAT_DEBT, CAT_MORTGAGE, etc.) to assess their impact on the default rate.

  • CAT_DEBT and CAT_CREDIT_CARD:
    Having debt (CAT_DEBT = 1) slightly increases the default rate compared to not having debt (CAT_DEBT = 0). Similarly, having a credit card (CAT_CREDIT_CARD = 1) corresponds to a higher default rate than not having one. This suggests that access to these financial products, while common, may increase exposure to financial risk.

  • CAT_SAVINGS_ACCOUNT:
    Customers who do not have a savings account (CAT_SAVINGS_ACCOUNT = 0) exhibit a higher default rate (42.86%) compared to those who do (28.30%). This is a crucial finding, indicating that the presence of a savings buffer acts as a protective factor against default.

  • CAT_DEPENDENTS:
    Surprisingly, having dependents (CAT_DEPENDENTS = 1) only slightly increases the default rate compared to those without dependents (35.35% vs 27.18%), suggesting that this factor alone is not a strong predictor of default risk.

4.4. Feature Correlation Analysis - Identifying Redundant Features¶

In [157]:
# Find highly correlated feature pairs (potential redundancy)
corr_matrix = df[numeric_cols].corr().abs()

# Get upper triangle
upper_tri   = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find pairs with correlation > 0.9
high_corr_pairs = []
for col in upper_tri.columns:
    for idx in upper_tri.index:
        if upper_tri.loc[idx, col] > 0.9:
            high_corr_pairs.append((idx, col, upper_tri.loc[idx, col]))

# Sort by correlation
high_corr_pairs.sort(key=lambda x: x[2], reverse=True)

print(f"Found {len(high_corr_pairs)} highly correlated feature pairs (r > 0.9):")
print("="*70)
for feat1, feat2, corr_val in high_corr_pairs[:20]:  # Show top 20
    print(f"{feat1} <-> {feat2}: {corr_val:.3f}")
Found 32 highly correlated feature pairs (r > 0.9):
======================================================================
T_UTILITIES_12 <-> T_UTILITIES_6: 1.000
T_EDUCATION_12 <-> T_EDUCATION_6: 1.000
T_HOUSING_12 <-> T_HOUSING_6: 1.000
T_GAMBLING_12 <-> T_GAMBLING_6: 1.000
T_TAX_12 <-> T_TAX_6: 0.999
T_GROCERIES_12 <-> T_GROCERIES_6: 0.996
T_ENTERTAINMENT_12 <-> T_ENTERTAINMENT_6: 0.994
T_EXPENDITURE_12 <-> T_EXPENDITURE_6: 0.989
INCOME <-> T_EXPENDITURE_12: 0.977
T_TRAVEL_12 <-> T_TRAVEL_6: 0.975
T_GAMBLING_6 <-> CAT_GAMBLING: 0.970
T_GAMBLING_12 <-> CAT_GAMBLING: 0.969
T_GROCERIES_12 <-> T_UTILITIES_6: 0.965
T_GROCERIES_12 <-> T_UTILITIES_12: 0.964
INCOME <-> T_EXPENDITURE_6: 0.963
T_GROCERIES_6 <-> T_UTILITIES_6: 0.953
T_GROCERIES_6 <-> T_UTILITIES_12: 0.952
T_TAX_6 <-> T_UTILITIES_6: 0.927
T_TAX_12 <-> T_UTILITIES_6: 0.927
T_TAX_6 <-> T_UTILITIES_12: 0.926

4.5. Principal component analysis (PCA)¶

The original dataset contains a large number of numerical variables derived from financial indicators, spending categories, and ratio-based measures. Many of these variables are highly correlated, which can lead to multicollinearity issues and reduce model stability. To address this, Principal Component Analysis (PCA) was applied to reduce dimensionality, remove correlations between variables, and extract latent structures representative of financial behavior.

Before applying Principal Component Analysis (PCA), all numerical variables were robustly scaled in order to reduce the influence of extreme values and outliers, which are common in financial data. Unlike standard scaling based on the mean and standard deviation, robust scaling relies on the median and the interquartile range (IQR), making it more stable for skewed distributions.

Robust_scale.png

This scaling can handle extreme value - outliers.¶

In [158]:
from sklearn.preprocessing import RobustScaler

x       = df.drop(columns=['CREDIT_SCORE', 'DEFAULT', 'CREDIT_CATEGORY'])
y_class = df['DEFAULT']
y_reg   = df['CREDIT_SCORE']
y_seg   = df['CREDIT_CATEGORY']

scaler      = RobustScaler()
x_scaled    = pd.DataFrame(scaler.fit_transform(x), columns=x.columns)

Fit PCA on x-scaled¶

In [159]:
from sklearn.decomposition import PCA

pca = PCA()                   
pca.fit(x_scaled)
Out[159]:
PCA()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
n_components  None
copy  True
whiten  False
svd_solver  'auto'
tol  0.0
iterated_power  'auto'
n_oversamples  10
power_iteration_normalizer  'auto'
random_state  None
In [160]:
# Cumulative variance
cum_var = np.cumsum(pca.explained_variance_ratio_)

# Choose threshold
threshold       = 0.90   # 90% explained variance

n_components    = np.argmax(cum_var >= threshold) + 1
plt.figure(figsize=(10,5))
plt.plot(cum_var, marker='o', label="Cumulative Explained Variance")

# Horizontal threshold
plt.axhline(threshold, color='red', linestyle='--', label=f"{int(threshold*100)}% threshold")

# Vertical line marking n_components
plt.axvline(n_components-1, color='green', linestyle='--')
plt.text(
    n_components-1, threshold + 0.02
    , f"n_components = {n_components}"
    , color     = 'green'
    , fontsize  = 12
)

plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Scree Plot with Threshold")
plt.grid(True)
plt.legend()
plt.show()

print(f"Selected number of components: {n_components}")
Selected number of components: 18

Based on PCA scree plot: Based on the eigenvalue plot and the cumulative explained variance, a threshold of 90% explained variance was established. This threshold is reached when 18 principal components are retained. Beyond this point, additional components provide only marginal gains in terms of explained variance.

fit pca with chosen component¶

In [161]:
pca   = PCA(n_components=n_components)
x_pca = pca.fit_transform(x_scaled)
In [162]:
pca.explained_variance_ratio_
Out[162]:
array([0.16907334, 0.14337906, 0.10834535, 0.08341869, 0.07777103,
       0.06228348, 0.04958545, 0.04526524, 0.0350692 , 0.03352146,
       0.01733876, 0.01395499, 0.01284695, 0.01241365, 0.01101743,
       0.01076421, 0.00934558, 0.0086326 ])
In [163]:
pca.explained_variance_ratio_.sum()
Out[163]:
np.float64(0.904026468600927)

Shows which original variables contribute the most to each PC.¶

In [164]:
loadings = pd.DataFrame(
    pca.components_.T
    , columns   = [f'PC{i+1}' for i in range(n_components)]
    , index     = x.columns
)
In [165]:
pc_tables = {}

for i in range(n_components):
    pc_name     = f'PC{i+1}'
    top_vals    = loadings[pc_name].abs().sort_values(ascending=False).head(10)
    pc_tables[pc_name] = pd.DataFrame({
        'Variable'          : top_vals.index
        , 'Loading Value'   : top_vals.values
    })

for pc, table in pc_tables.items():
    print(f"\n==== {pc} ====")
    display(table)
==== PC1 ====
Variable Loading Value
0 R_GAMBLING_SAVINGS 0.264770
1 INCOME 0.215508
2 T_EXPENDITURE_6 0.208653
3 T_CLOTHING_12 0.201097
4 T_EXPENDITURE_12 0.200726
5 T_TAX_12 0.200374
6 T_TAX_6 0.200075
7 R_EXPENDITURE_INCOME 0.199031
8 T_GROCERIES_6 0.196335
9 T_UTILITIES_12 0.195952
==== PC2 ====
Variable Loading Value
0 R_GAMBLING_SAVINGS 0.621232
1 R_EXPENDITURE_SAVINGS 0.167549
2 R_EDUCATION_SAVINGS 0.159937
3 SAVINGS 0.156874
4 R_GROCERIES_SAVINGS 0.156814
5 R_SAVINGS_INCOME 0.155187
6 R_GAMBLING_INCOME 0.154184
7 R_ENTERTAINMENT_SAVINGS 0.150321
8 R_HEALTH_INCOME 0.146570
9 R_HEALTH_DEBT 0.142328
==== PC3 ====
Variable Loading Value
0 R_TAX 0.962009
1 T_FINES_12 0.096223
2 R_FINES_SAVINGS 0.088900
3 T_FINES_6 0.081871
4 R_FINES_INCOME 0.080637
5 R_FINES_DEBT 0.078642
6 T_TRAVEL_12 0.055328
7 T_TRAVEL_6 0.046539
8 R_HEALTH_DEBT 0.044465
9 R_EXPENDITURE_INCOME 0.038941
==== PC4 ====
Variable Loading Value
0 R_GAMBLING_SAVINGS 0.523514
1 R_EDUCATION_SAVINGS 0.209941
2 R_GAMBLING_DEBT 0.206583
3 R_HEALTH_DEBT 0.200184
4 R_GAMBLING_INCOME 0.192789
5 R_UTILITIES_DEBT 0.171324
6 R_HEALTH_INCOME 0.167269
7 R_GROCERIES_SAVINGS 0.163180
8 R_GROCERIES_DEBT 0.162360
9 R_TRAVEL_SAVINGS 0.154809
==== PC5 ====
Variable Loading Value
0 T_FINES_12 0.458215
1 R_FINES_INCOME 0.438916
2 T_FINES_6 0.412101
3 R_FINES_SAVINGS 0.395987
4 R_FINES_DEBT 0.395031
5 R_GAMBLING_SAVINGS 0.176785
6 R_TAX 0.173157
7 R_UTILITIES_INCOME 0.072223
8 R_TRAVEL_INCOME 0.064162
9 R_EXPENDITURE 0.047921
==== PC6 ====
Variable Loading Value
0 R_TRAVEL_INCOME 0.321089
1 R_EXPENDITURE 0.302117
2 R_UTILITIES_INCOME 0.277703
3 T_TRAVEL_6 0.276244
4 R_GROCERIES 0.265586
5 T_TRAVEL_12 0.258799
6 R_TRAVEL 0.242218
7 R_TRAVEL_DEBT 0.200495
8 R_GROCERIES_INCOME 0.194468
9 R_CLOTHING_DEBT 0.163115
==== PC7 ====
Variable Loading Value
0 R_TAX_DEBT 0.370426
1 R_ENTERTAINMENT_INCOME 0.260088
2 R_CLOTHING_DEBT 0.246320
3 R_ENTERTAINMENT_DEBT 0.225659
4 T_ENTERTAINMENT_6 0.221338
5 T_ENTERTAINMENT_12 0.214602
6 R_TAX_SAVINGS 0.191690
7 R_ENTERTAINMENT_SAVINGS 0.186203
8 R_EXPENDITURE 0.184222
9 R_DEBT_INCOME 0.180375
==== PC8 ====
Variable Loading Value
0 R_EXPENDITURE 0.337961
1 R_GROCERIES 0.321378
2 T_TRAVEL_12 0.307650
3 R_HEALTH_SAVINGS 0.288540
4 R_TRAVEL 0.252120
5 R_HEALTH 0.246770
6 R_TRAVEL_INCOME 0.226208
7 R_CLOTHING_INCOME 0.220396
8 T_TRAVEL_6 0.213178
9 R_CLOTHING 0.212854
==== PC9 ====
Variable Loading Value
0 R_EDUCATION_SAVINGS 0.487227
1 R_EDUCATION_DEBT 0.420181
2 R_EDUCATION_INCOME 0.345373
3 T_EDUCATION_12 0.265766
4 T_EDUCATION_6 0.265618
5 R_HEALTH_SAVINGS 0.192148
6 R_HOUSING_SAVINGS 0.159801
7 R_CLOTHING_INCOME 0.155300
8 R_UTILITIES_SAVINGS 0.151580
9 R_ENTERTAINMENT_INCOME 0.136639
==== PC10 ====
Variable Loading Value
0 R_HOUSING_DEBT 0.463402
1 R_HOUSING_SAVINGS 0.408819
2 R_HOUSING_INCOME 0.386008
3 T_HOUSING_12 0.346977
4 T_HOUSING_6 0.346896
5 R_UTILITIES_INCOME 0.208940
6 R_TRAVEL_INCOME 0.156058
7 CAT_MORTGAGE 0.147646
8 R_GROCERIES_INCOME 0.121152
9 R_HEALTH_INCOME 0.111284
==== PC11 ====
Variable Loading Value
0 R_HEALTH_SAVINGS 0.453949
1 R_ENTERTAINMENT_INCOME 0.268149
2 R_UTILITIES_SAVINGS 0.259457
3 R_HEALTH_INCOME 0.218178
4 T_TRAVEL_12 0.214122
5 R_TAX_INCOME 0.200787
6 R_EXPENDITURE 0.184827
7 R_HEALTH_DEBT 0.175331
8 T_TRAVEL_6 0.171943
9 R_TRAVEL_DEBT 0.155098
==== PC12 ====
Variable Loading Value
0 R_EXPENDITURE_INCOME 0.385980
1 R_CLOTHING_INCOME 0.305621
2 R_GROCERIES_SAVINGS 0.296337
3 R_EDUCATION 0.281494
4 R_GROCERIES_INCOME 0.251981
5 R_FINES_DEBT 0.212099
6 R_ENTERTAINMENT_SAVINGS 0.192833
7 T_GROCERIES_6 0.164359
8 T_GROCERIES_12 0.162690
9 R_EDUCATION_INCOME 0.161863
==== PC13 ====
Variable Loading Value
0 R_GAMBLING 0.750523
1 R_EDUCATION 0.514478
2 T_FINES_6 0.146086
3 R_FINES_DEBT 0.130870
4 R_ENTERTAINMENT 0.127652
5 R_EXPENDITURE_INCOME 0.110343
6 R_FINES 0.109316
7 R_ENTERTAINMENT_INCOME 0.085537
8 R_HEALTH 0.081543
9 R_TAX_DEBT 0.077120
==== PC14 ====
Variable Loading Value
0 R_EDUCATION 0.506591
1 R_GAMBLING 0.390975
2 R_CLOTHING_INCOME 0.342160
3 R_EXPENDITURE_INCOME 0.271218
4 R_GROCERIES_SAVINGS 0.255939
5 R_FINES 0.206527
6 R_TAX_DEBT 0.171493
7 R_FINES_DEBT 0.141373
8 R_CLOTHING_DEBT 0.138850
9 R_SAVINGS_INCOME 0.122553
==== PC15 ====
Variable Loading Value
0 R_ENTERTAINMENT 0.430933
1 R_EDUCATION 0.405239
2 R_FINES 0.317757
3 R_EXPENDITURE_INCOME 0.317470
4 R_ENTERTAINMENT_INCOME 0.252775
5 R_TAX_DEBT 0.230853
6 R_GROCERIES 0.197504
7 T_ENTERTAINMENT_6 0.180592
8 R_GROCERIES_SAVINGS 0.176899
9 R_HEALTH_INCOME 0.168013
==== PC16 ====
Variable Loading Value
0 R_FINES 0.887295
1 R_EDUCATION 0.343998
2 R_ENTERTAINMENT 0.144798
3 R_FINES_SAVINGS 0.114536
4 R_ENTERTAINMENT_INCOME 0.086826
5 R_FINES_DEBT 0.066263
6 R_CLOTHING_INCOME 0.064834
7 R_EDUCATION_SAVINGS 0.063090
8 T_ENTERTAINMENT_6 0.062206
9 T_FINES_12 0.062132
==== PC17 ====
Variable Loading Value
0 R_GAMBLING 0.418978
1 R_EXPENDITURE_INCOME 0.374325
2 R_ENTERTAINMENT 0.361073
3 R_CLOTHING_INCOME 0.324506
4 R_TRAVEL 0.273148
5 R_EDUCATION 0.261715
6 R_EXPENDITURE 0.195481
7 R_HEALTH_SAVINGS 0.167330
8 R_GROCERIES 0.157019
9 R_HEALTH 0.155118
==== PC18 ====
Variable Loading Value
0 R_GROCERIES 0.616102
1 R_TRAVEL 0.459643
2 R_HEALTH 0.377438
3 R_ENTERTAINMENT 0.185389
4 R_CLOTHING 0.185278
5 R_EXPENDITURE_INCOME 0.153569
6 R_UTILITIES_INCOME 0.146982
7 R_GAMBLING 0.127939
8 R_FINES_DEBT 0.114547
9 R_TAX_INCOME 0.113939
PC1 — Total Spending Intensity¶
  • Dominant features: TEXPENDITURE*, T_CLOTHING_12, T_UTILITIES_12, INCOME, R_GAMBLING_SAVINGS
  • Meaning: High overall spending volume + weak savings efficiency, especially linked to gambling-related ratios.
PC2 — Savings Efficiency Across Categories¶
  • Dominant features: R_GAMBLING_SAVINGS, R_EXPENDITURE_SAVINGS, R_EDUCATION_SAVINGS, SAVINGS, R_SAVINGS_INCOME
  • Meaning: Measures how efficiently customers save relative to their spending.
PC3 — Tax Burden¶
  • Dominant feature: R_TAX (very high loading)
  • Meaning: Indicates high tax pressure relative to income or savings.
PC4 — Gambling & Education/Health Financial Stress¶
  • Dominant features: R_GAMBLING_SAVINGS, R_GAMBLING_DEBT, R_EDUCATION_SAVINGS, R_HEALTH_DEBT
  • Meaning: Captures financial stress due to gambling, education costs, and health expenses.
PC5 — Fines & Penalty Behavior¶
  • Dominant features: T_FINES_12, T_FINES_6, RFINES*
  • Meaning: Represents customers with frequent penalties or fines.
PC6 — Travel & Utilities Lifestyle Pattern¶
  • Dominant features: R_TRAVEL_INCOME, R_EXPENDITURE, R_UTILITIES_INCOME, TTRAVEL*
  • Meaning: Measures travel-oriented and utility-heavy spending patterns.
PC7 — Entertainment Spending Supported by Debt¶
  • Dominant features: R_ENTERTAINMENT_INCOME, R_CLOTHING_DEBT, TENTERTAINMENT*, R_TAX_DEBT
  • Meaning: Discretionary entertainment spending that is partly financed through debt.
PC8 — General Lifestyle Consumption Mix¶
  • Dominant features: R_GROCERIES, R_EXPENDITURE, R_TRAVEL, R_HEALTH_SAVINGS
  • Meaning: Balanced lifestyle component covering food, travel, and health spending.
PC9 — Education Financial Burden¶
  • Dominant features: R_EDUCATION_SAVINGS, R_EDUCATION_DEBT, R_EDUCATION_INCOME, TEDUCATION*
  • Meaning: Education-related long-term financial obligations.
PC10 — Housing & Mortgage Burden¶
  • Dominant features: R_HOUSING_DEBT, R_HOUSING_SAVINGS, R_HOUSING_INCOME, THOUSING*, CAT_MORTGAGE
  • Meaning: Measures housing debt pressure and mortgage commitments.
PC11 — Health & Utilities Pressure¶
  • Dominant features: R_HEALTH_SAVINGS, R_UTILITIES_SAVINGS, R_HEALTH_INCOME
  • Meaning: Intensity of health and utilities expenses relative to savings/income.
PC12 — Core Grocery & Clothing Expenditure Patterns¶
  • Dominant features: R_EXPENDITURE_INCOME, R_CLOTHING_INCOME, R_GROCERIES_SAVINGS, R_EDUCATION
  • Meaning: Stable essential spending behaviors (food, clothing, education).
PC13 — Gambling + Education Mixed Pattern¶
  • Dominant features: R_GAMBLING, R_EDUCATION
  • Meaning: Customers who combine gambling activity with education spending.
PC14 — Clothing, Education & Tax Mix¶
  • Dominant features: R_EDUCATION, R_GAMBLING, R_CLOTHING_INCOME, R_EXPENDITURE_INCOME
  • Meaning: A mixture of education cost, gambling, clothing expense ratios, and general expenditure ratios.
PC15 — Entertainment + Education + Fines Behavior¶
  • Dominant features: R_ENTERTAINMENT, R_EDUCATION, R_FINES, R_EXPENDITURE_INCOME
  • Meaning: Moderate discretionary spending with occasional financial penalties.
PC16 — Fines-Only Burden¶
  • Dominant features: R_FINES, R_FINES_SAVINGS, R_FINES_DEBT
  • Meaning: A second fines-focused dimension capturing penalty structure.
PC17 — General Behavioral Risk Pattern¶
  • Dominant features: R_GAMBLING, R_ENTERTAINMENT, R_CLOTHING_INCOME, R_TRAVEL
  • Meaning: A combined behavioral risk factor related to gambling and lifestyle consumption.
PC18 — Food, Travel & Health Cost Structure¶
  • Dominant features: R_GROCERIES, R_TRAVEL, R_HEALTH
  • Meaning: Day-to-day living cost structure (food, mobility, healthcare).

2D PCA Scatter Plot¶

In [166]:
# PCA 2D Scatter Plot by Credit Category
plt.figure(figsize=(10,7))
plt.scatter(x_pca[:, 0], x_pca[:, 1], c=y_seg.map({'Poor':0,'Fair':1,'Good':2,'Excellent':3}), cmap='viridis')
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA 2D Scatter Plot by Credit Category")
plt.colorbar()
plt.show()

When points are colored by credit category shown in plot above, no clear separation between categories is observed. All categories largely overlap in the PCA space, indicating that the main components capture general financial behavior rather than predefined credit classes.

In [167]:
# PCA 2D Scatter by DEFAULT
plt.figure(figsize=(10,7))
plt.scatter(x_pca[:,0], x_pca[:,1], c=y_class, cmap='coolwarm', alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA 2D Scatter by DEFAULT")
plt.colorbar()
plt.show()

When coloring by DEFAULT status in plot PCA 2D PCA Scatter by DEFAULT, defaulters and non-defaulters are also highly mixed. There is no clear boundary separating the two groups in the PC1–PC2 plane. This confirms that PCA, as an unsupervised method, does not directly discriminate default risk.

In [168]:
# PCA 2D Scatter by CREDIT_SCORE
plt.figure(figsize=(10,7))
plt.scatter(x_pca[:,0], x_pca[:,1], c=y_reg, cmap='plasma', alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA 2D Scatter by CREDIT_SCORE")
plt.colorbar()
plt.show()

when points are colored by CREDIT SCORE in plot above, a gradual color transition can be observed rather than sharp clusters. Higher and lower credit scores are spread across the space, with slight local trends but no clear linear separation.

Overally, The PCA plot (PC1 vs PC2) shows that the four credit category, risk default and credit score overlap heavily , no clear clusters nor regression line. This means the main direction of variation in data mostly by spending and saving behavior not align with credit categories.

Since PCA is unsupervised and focuses on overall variance, its normal that the credit froups are not visually seperated in the first 2 components.

With PCA data, we can also work toward a new task : Customer Financial Behavior Segmentation.

PCA-based Dataset¶

In [169]:
df_pca              = pd.DataFrame(x_pca, columns=[f'PC{i+1}' for i in range(n_components)])
df_pca["DEFAULT"]   = y_class
df_pca["CREDIT_SCORE"] = y_reg
df_pca["CATEGORY"]     = y_seg

df_pca.head()
Out[169]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC12 PC13 PC14 PC15 PC16 PC17 PC18 DEFAULT CREDIT_SCORE CATEGORY
0 -1.263568 1.675703 1.616015 -1.777544 0.900540 0.361975 -2.571286 -0.326115 0.336299 0.484445 ... -1.161316 -0.392118 -0.575822 0.026258 1.003179 -0.838064 0.592423 1 444.0 Poor
1 -1.113713 0.097301 -0.083219 -1.488284 -0.278663 -1.442984 1.640455 -2.546311 -2.071634 1.182105 ... 0.090272 0.011836 -1.387963 -1.021540 0.263893 -0.475882 0.268768 0 625.0 Good
2 0.244502 6.755585 -0.368538 1.002296 4.414212 1.369341 -3.355328 0.337632 -1.563512 0.385298 ... 0.491544 -0.693706 -0.040053 0.396325 0.708798 0.652537 1.179665 1 469.0 Poor
3 2.228464 5.653206 2.395674 0.698185 -1.647916 0.834930 0.376260 -0.190960 0.288253 1.457348 ... -0.885084 0.668575 -0.342176 0.983585 -1.665101 -0.285464 0.304583 0 559.0 Fair
4 1.300493 -2.019472 -2.782794 0.626834 -0.635844 -2.531894 -2.460998 -1.658941 0.163559 0.804533 ... 0.938645 0.124539 0.018004 0.481896 -0.458401 -0.139912 -0.680096 0 473.0 Poor

5 rows × 21 columns

We can use PCA_data to cluster customer financial behaviour segmentation¶

In [170]:
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

# Use PC1–PC10 for clustering
X_cluster = df_pca[[f'PC{i+1}' for i in range(n_components)]]

sil_scores = []
K_range = range(2, 10)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_cluster)
    sil = silhouette_score(X_cluster, labels)
    sil_scores.append(sil)
    print(f"K={k}  Silhouette Score = {sil:.4f}")

plt.figure(figsize=(6,4))
plt.plot(K_range, sil_scores, marker='o')
plt.xlabel("Number of clusters (K)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Analysis")
plt.grid(True)
plt.show()
K=2  Silhouette Score = 0.1493
K=3  Silhouette Score = 0.1649
K=4  Silhouette Score = 0.1651
K=5  Silhouette Score = 0.1705
K=6  Silhouette Score = 0.1833
K=7  Silhouette Score = 0.1672
K=8  Silhouette Score = 0.1787
K=9  Silhouette Score = 0.1664
In [171]:
from sklearn.cluster import KMeans
# Use PC1–PC10 for clustering
X_cluster = df_pca[[f'PC{i+1}' for i in range(n_components)]]

kmeans    = KMeans(n_clusters=6, random_state=42)
clusters  = kmeans.fit_predict(X_cluster)

# add cluster labels
df_pca['Cluster'] = clusters
In [172]:
df_pca['Cluster'].unique()
Out[172]:
array([2, 5, 3, 0, 4, 1], dtype=int32)
In [173]:
df_pca.head()
Out[173]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC13 PC14 PC15 PC16 PC17 PC18 DEFAULT CREDIT_SCORE CATEGORY Cluster
0 -1.263568 1.675703 1.616015 -1.777544 0.900540 0.361975 -2.571286 -0.326115 0.336299 0.484445 ... -0.392118 -0.575822 0.026258 1.003179 -0.838064 0.592423 1 444.0 Poor 2
1 -1.113713 0.097301 -0.083219 -1.488284 -0.278663 -1.442984 1.640455 -2.546311 -2.071634 1.182105 ... 0.011836 -1.387963 -1.021540 0.263893 -0.475882 0.268768 0 625.0 Good 2
2 0.244502 6.755585 -0.368538 1.002296 4.414212 1.369341 -3.355328 0.337632 -1.563512 0.385298 ... -0.693706 -0.040053 0.396325 0.708798 0.652537 1.179665 1 469.0 Poor 5
3 2.228464 5.653206 2.395674 0.698185 -1.647916 0.834930 0.376260 -0.190960 0.288253 1.457348 ... 0.668575 -0.342176 0.983585 -1.665101 -0.285464 0.304583 0 559.0 Fair 3
4 1.300493 -2.019472 -2.782794 0.626834 -0.635844 -2.531894 -2.460998 -1.658941 0.163559 0.804533 ... 0.124539 0.018004 0.481896 -0.458401 -0.139912 -0.680096 0 473.0 Poor 0

5 rows × 22 columns

Compute PCA Component means per cluster¶

In [174]:
cluster_profile = df_pca.groupby('Cluster')[ [f'PC{i+1}' for i in range(n_components)] ].mean().round(2)
cluster_profile
Out[174]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18
Cluster
0 1.84 -1.58 -0.18 -1.18 -0.17 -0.51 0.07 -0.03 0.10 0.03 -0.09 0.06 0.07 0.04 -0.07 -0.02 -0.04 -0.02
1 -1.29 5.18 0.58 1.40 -1.56 -0.75 -1.47 0.48 0.42 -0.34 0.52 -0.31 -0.10 0.04 -0.09 0.11 0.21 -0.03
2 -3.44 1.33 0.09 -2.11 -0.59 -0.08 -0.07 0.01 -0.23 0.09 -0.20 0.12 -0.01 0.14 -0.10 -0.01 -0.04 -0.02
3 3.19 2.48 0.05 1.77 -1.25 0.49 0.40 -0.32 -0.12 0.01 -0.30 -0.03 0.03 0.03 0.02 -0.06 -0.03 0.02
4 -2.26 -2.51 0.63 1.94 0.06 0.51 0.11 0.17 -0.03 -0.08 0.39 -0.10 -0.07 -0.19 0.20 0.03 0.07 0.05
5 0.56 2.51 -1.90 0.21 6.44 0.64 -0.30 -0.04 0.11 0.12 0.07 -0.01 -0.10 -0.10 -0.03 0.10 0.01 -0.01

Cluster Interpretations Based on PCA Component Means¶

Cluster 0 — Financially Stable, Efficient Savers

  • PC2 = –1.89 → very strong savings efficiency
  • PC1 ≈ 0 → normal spending
  • PC4 ≈ 0 → low gambling/education/health stress
  • PC6 = –0.09 → slightly lower travel/utilities lifestyle
  • PC11 = 0.08 → slight health/utilities cost
    → This cluster represents low-risk, financially disciplined customers.

Cluster 1 — Moderate Lifestyle Spend + Weak Savings

  • PC2 = +2.48 → poor savings efficiency
  • PC8 = +0.55 → higher lifestyle consumption (groceries/travel/health)
  • PC11 = +0.33 → mild health/utilities pressure
  • PC7 = –0.81 → low entertainment-with-debt behavior
  • PC4 = –1.24 → low gambling/education stress
    → These customers spend moderately but save poorly.

Cluster 2 — High Lifestyle Cost + Behavioral Risk

  • PC6 = +1.91 → very high travel/utilities lifestyle spending
  • PC7 = +1.09 → entertainment spending with debt
  • PC9 = +1.03 → strong education financial burden
  • PC12 = +1.39 → strong essential spending structure
  • PC14 = +0.83 → clothing/education/tax mix
  • PC11 = –1.36 → healthcare/utilities imbalance
    → This is the highest-risk cluster with costly lifestyle and weak stability.

Cluster 3 — High Spending + Stress from Gambling/Education/Health

  • PC1 = +3.11 → high overall spending
  • PC2 = +2.78 → poor savings efficiency
  • PC4 = +1.78 → strong gambling/education/health stress
  • PC6 = +0.59 → elevated travel/utilities lifestyle
  • PC7 = +0.37 → entertainment spending with debt
    → High spenders under financial stress with weak savings.

Clustering Visualization¶

Scatter Plot of PC1 vs PC2 (Individual Customers)¶
In [175]:
plt.figure(figsize=(10, 8))
sns.scatterplot(
    x       = 'PC1'
    , y     = 'PC2'
    , hue   = 'Cluster'
    , data  = df_pca
    , palette   = 'viridis'
    , s     = 70 
    , alpha = 0.7
    , legend= 'full'
)
plt.title("Scatter Plot of PC1 vs PC2 (After Clustering)")
Out[175]:
Text(0.5, 1.0, 'Scatter Plot of PC1 vs PC2 (After Clustering)')
Cluster Profile Bar Chart (Centroid Interpretation)¶
In [176]:
# Select the first n Principal Components
pc_cols                 = [f'PC{i+1}' for i in range(n_components)]
cluster_profile_plot    = cluster_profile[pc_cols]

# Prepare data for plotting
cluster_profile_melted = cluster_profile_plot.reset_index().melt(
    id_vars     = 'Cluster'
    , var_name  = 'Principal Component'
    , value_name= 'Mean Score'
)

plt.figure(figsize=(12, 6))
sns.barplot(
    x       = 'Principal Component'
    , y     = 'Mean Score'
    , hue   ='Cluster'
    , data  = cluster_profile_melted
    , palette= 'viridis'
)

plt.title('Cluster Profile: Mean Principal Component Scores (Centroids)')
plt.ylabel('Mean PC Score')
plt.xlabel('Principal Component')
plt.axhline(0, color='black', linestyle='--', linewidth=1) # Reference line at 0
plt.legend(title='Cluster ID')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.savefig('cluster_profile_bar_chart.png')
plt.show()
In [177]:
# Feature engineering 
# --- 1. Drop Non-Feature Columns ---
if 'CUST_ID' in df.columns:
    df = df.drop(columns=['CUST_ID'])
# 2. Drop Non-feature  columns
if 'Cluster_ID' in df.columns:
    # One-hot encode the cluster ID columns 
    cluster_dummies = pd.get_dummies(df['Cluster_ID'], prefix='CLUSTER', drop_first=True)
    df              = pd.concat([df.drop('Cluster_ID', axis=1), cluster_dummies], axis=1)
    print("Cluster IDs successfully integrated and one-hot encoded. ")
else:
    print("Warning: 'Cluster_ID' not found. Skipping cluster integration.")
Warning: 'Cluster_ID' not found. Skipping cluster integration.

5. Advanced Feature Engineering¶

Analysis shows:

  • Top feature (R_DEBT_INCOME) explains only 61% variance
  • Remaining features provide weak marginal gains
  • Missing interaction effects and volatility measures

Strategy: Create higher-order features to capture:

  1. Interaction effects between key financial metrics
  2. Spending volatility across categories
  3. Polynomial relationships for non-linear patterns
  4. Financial stability indicators

5.1. Canonical Correlation Analysis¶

While PCA focuses on variance maximization within a single set of variables, it does not explicitly measure relationships between different groups of variables. To analyze the dependency structure between financial indicators and spending behavior variables, we apply Canonical Correlation Analysis (CCA). CCA finds linear relationships between two sets of variables.

Here we analyze:

  • Group 1: Financial status variables

  • Group 2: Spending behavior variables

Canonical Correlation Analysis (CCA)¶

CCA seeks linear combinations of two variable blocks that are maximally correlated.

  • Data Matrices

    Let

    $$ X \in \mathbb{R}^{n \times p} $$

    denote the matrix of financial variables (income, savings, debt, financial ratios), and

    $$ Y \in \mathbb{R}^{n \times q} $$

    denote the matrix of spending-related variables (housing, gambling, leisure, health, travel, etc.).
    Both matrices are centered and standardized.

  • Canonical Variates

    CCA aims to find vectors $a \in \mathbb{R}^p$ and $b \in \mathbb{R}^q$ such that the linear projections

    $$ u = Xa, \qquad v = Yb $$

    are maximally correlated.

  • Optimization Problem

    The canonical correlation is obtained by solving:

    $$ \max_{a,b} \; \mathrm{corr}(Xa, Yb) = \frac{a^\top \Sigma_{XY} b} {\sqrt{a^\top \Sigma_{XX} a \; b^\top \Sigma_{YY} b}}, $$

    subject to the constraints:

    $$ a^\top \Sigma_{XX} a = 1, \qquad b^\top \Sigma_{YY} b = 1, $$

    where $\Sigma_{XX}$ and $\Sigma_{YY}$ are within-set covariance matrices and
    $\Sigma_{XY}$ is the cross-covariance matrix.

  • Generalized Eigenvalue Problems

    The optimization leads to the following eigenvalue equations:

    $$ \Sigma_{XX}^{-1} \Sigma_{XY} \Sigma_{YY}^{-1} \Sigma_{YX} a = \rho^2 a, $$

    $$ \Sigma_{YY}^{-1} \Sigma_{YX} \Sigma_{XX}^{-1} \Sigma_{XY} b = \rho^2 b, $$

    where $\rho$ is the canonical correlation coefficient.

In [178]:
# Group 1 — Financial condition
X_financial     = df[["INCOME", "SAVINGS", "DEBT", "R_SAVINGS_INCOME", "R_DEBT_INCOME"]].copy()

# Group 2 — Behavioral spending variables (use key 12-month expenditures)
spending_cols   = [
    "T_GROCERIES_12"
    , "T_CLOTHING_12"
    , "T_EDUCATION_12"
    , "T_HEALTH_12"
    , "T_TRAVEL_12"
    , "T_ENTERTAINMENT_12"
    , "T_GAMBLING_12"
    , "T_HOUSING_12"
]
X_spending  = df[spending_cols].copy()

# Scale both groups
sc1     = StandardScaler()
sc2     = StandardScaler()

X1      = sc1.fit_transform(X_financial)
X2      = sc2.fit_transform(X_spending)

# Fit CCA

cca         = CCA(n_components=2)
X1_c, X2_c  = cca.fit_transform(X1, X2)

rho1 = np.corrcoef(X1_c[:, 0], X2_c[:, 0])[0, 1]
rho2 = np.corrcoef(X1_c[:, 1], X2_c[:, 1])[0, 1]
print("Canonical correlations:", rho1, rho2)

# Visualization
plt.figure(figsize=(8,6))
plt.scatter(X1_c[:,0], X2_c[:,0], alpha=0.6, c=df["DEFAULT"], cmap="coolwarm")
plt.xlabel("Canonical Variable 1 — Financial")
plt.ylabel("Canonical Variable 1 — Spending Behavior")
plt.title("CCA: Relationship Between Financial Status and Spending Behavior")
plt.colorbar(label="DEFAULT")
plt.grid(True)
plt.show()
Canonical correlations: 0.9914443873851756 0.7537951088810095

Canonical Correlation Analysis (CCA) shows an extremely strong association between financial indicators (income, savings, debt ratios) and spending behavior. The first canonical correlation is 0.991, indicating almost perfect alignment between the two multivariate structures. A second independent dimension $(\rho = 0.754)$ reveals a secondary but still strong financial–behavior pattern. Overall, financial status and spending patterns are highly coupled in this dataset, and even defaulting customers follow the same structure, only shifted toward the lower end of financial strength.

5.2. Factorial Correspondance Analysis¶

Principal Component Analysis (PCA) is designed for continuous numerical variables and is therefore not suitable for studying relationships between categorical variables. To analyze associations between categorical attributes and default behavior, we apply Factorial Correspondence Analysis (FCA), also known as Correspondence Analysis. FCA is used to analyze relationships between two categorical variables. Here we study:

  • CREDIT_CATEGORY vs DEFAULT
  • CAT_GAMBLING vs DEFAULT

FCA is based on the analysis of a contingency table constructed from two categorical variables.
In this study, it is applied to cross-tabulations such as CREDIT_CATEGORY vs DEFAULT and
CAT_GAMBLING vs DEFAULT to visualize and quantify their dependency structure.

  • Contingency Table

    Let $N = (n_{ij})$ denote the contingency table with $I$ row categories and $J$ column categories, and let

    $$ n = \sum_{i=1}^{I} \sum_{j=1}^{J} n_{ij}. $$

    The matrix of relative frequencies is defined as:

    $$ P = \frac{1}{n} N. $$

  • Row and Column Masses

    The row and column marginal distributions (masses) are given by:

    $$ r_i = \sum_{j=1}^{J} p_{ij}, \qquad c_j = \sum_{i=1}^{I} p_{ij}. $$

    Let $r = (r_1, \dots, r_I)^\top$ and $c = (c_1, \dots, c_J)^\top$, and define the diagonal matrices:

    $$ D_r = \mathrm{diag}(r_1, \dots, r_I), \qquad D_c = \mathrm{diag}(c_1, \dots, c_J). $$

  • Independence Model and Standardized Residuals

    Under the hypothesis of independence between the two categorical variables, the expected frequency matrix is:

    $$ r c^\top. $$

    FCA analyzes deviations from this independence model using the standardized residual matrix:

    $$ S = D_r^{-1/2} (P - r c^\top) D_c^{-1/2}. $$

  • Singular Value Decomposition

    A Singular Value Decomposition (SVD) of $S$ is performed:

    $$ S = U \Delta V^\top, $$

    where $\Delta = \mathrm{diag}(\delta_1, \delta_2, \dots)$ contains the singular values.
    The associated eigenvalues

    $$ \lambda_k = \delta_k^2 $$

    represent the inertia explained by each factorial axis.

  • Principal Coordinates

    The principal coordinates of the row and column categories are respectively given by:

    $$ F = D_r^{-1/2} U \Delta, \qquad G = D_c^{-1/2} V \Delta. $$


In [179]:
# 1. CREDIT_CATEGORY vs DEFAULT
ct_credit = pd.crosstab(df["CREDIT_CATEGORY"], df["DEFAULT"])
ca_credit = prince.CA(n_components=2, random_state=42)
ca_credit = ca_credit.fit(ct_credit)

rows_credit = ca_credit.row_coordinates(ct_credit)
cols_credit = ca_credit.column_coordinates(ct_credit)

print(rows_credit)
print(cols_credit)

plt.figure(figsize=(10, 2))

# Plot rows (categories)
plt.scatter(rows_credit[0], [0]*len(rows_credit), color="blue", label="CREDIT_CATEGORY")
for i, label in enumerate(rows_credit.index):
    plt.text(rows_credit.iloc[i, 0], 0.02, label, color="blue")

# Plot columns (DEFAULT)
plt.scatter(cols_credit[0], [0]*len(cols_credit), color="red", label="DEFAULT")
for i, label in enumerate(cols_credit.index):
    plt.text(cols_credit.iloc[i, 0], -0.02, label, color="red")

plt.axvline(0, color="black", linewidth=0.5)
plt.yticks([])
plt.xlabel("FCA Dimension 1")
plt.title("1D Correspondence Analysis: CREDIT_CATEGORY vs DEFAULT")
plt.legend()
plt.savefig(f"{figure_dir}/fca_analysis_credit_category.png")
plt.show()
                        0
CREDIT_CATEGORY          
Excellent       -0.629800
Fair             0.087361
Good            -0.218432
Poor             0.848604
                0
DEFAULT          
0       -0.189525
1        0.477818

The result we get

Category FCA Coordinate Interpretation
Excellent −0.6298 Strongly aligned with non-default
Good −0.2184 Slightly aligned with non-default
Fair +0.0873 Slightly leaning toward risk
Poor +0.8486 Strongly aligned with default

Which we can interprete that, FCA confirms a monotonic relationship: as credit category worsens (Excellent → Poor), the likelihood of default increases, with “Poor” customers strongly associated with DEFAULT = 1 and “Excellent” with DEFAULT = 0.

In [180]:
# 2. CAT_GAMBLING vs DEFAULT

ct_credit = pd.crosstab(df["CAT_GAMBLING"], df["DEFAULT"])
ca_credit = prince.CA(n_components=2, random_state=42)
ca_credit = ca_credit.fit(ct_credit)

rows_credit = ca_credit.row_coordinates(ct_credit)
cols_credit = ca_credit.column_coordinates(ct_credit)

print(rows_credit)
print(cols_credit)

plt.figure(figsize=(10, 2))

# Plot rows (categories)
plt.scatter(rows_credit[0], [0]*len(rows_credit), color="blue", label="CAT_GAMBLING")
for i, label in enumerate(rows_credit.index):
    plt.text(rows_credit.iloc[i, 0], 0.02, label, color="blue")

# Plot columns (DEFAULT)
plt.scatter(cols_credit[0], [0]*len(cols_credit), color="red", label="DEFAULT")
for i, label in enumerate(cols_credit.index):
    plt.text(cols_credit.iloc[i, 0], -0.02, label, color="red")

plt.axvline(0, color="black", linewidth=0.5)
plt.yticks([])
plt.xlabel("FCA Dimension 1")
plt.title("1D Correspondence Analysis: CAT_GAMBLING vs DEFAULT")
plt.legend()
plt.savefig(f"{figure_dir}/fca_analysis_gambling.png")
plt.show()
                     0
CAT_GAMBLING          
0            -0.036054
1             0.001071
2             0.084202
                0
DEFAULT          
0       -0.032591
1        0.082165

The result, we got CAT_GAMBLING

CAT_GAMBLING Coordinate
0 = No gambling −0.036
1 = Low gambling +0.001
2 = High gambling +0.084

DEFAULT | DEFAULT | Coordinate | | ------------------ | ---------- | | 0 = No default | −0.033 | | 1 = Default | +0.082 |

FCA reveals a clear positive relationship between gambling intensity and default risk: high-gambling customers lie closest to the default category, while non-gamblers align with the non-default category. Low-gambling customers occupy an intermediate position, showing only a weak association.

Final Data Preparation and Feature Integration By Using the original dataset¶

In [181]:
# Define Features (X) and Targets (y)
y_classification = df['DEFAULT']
y_regression     = df['CREDIT_SCORE']

#Features (X) 
X      = df.drop(columns=['DEFAULT', 'CREDIT_SCORE', 'CREDIT_CATEGORY' ])

# Scale the Features
scaler      = StandardScaler()

# Fit and transform the features
X_scale     = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scale, columns=X.columns)

print(f"\nFinal feature set shape: {X_scaled_df.shape}")
print(f"Classification Target distribution:\n{y_classification.value_counts(normalize=True)}")
X_scaled_df
Final feature set shape: (1000, 84)
Classification Target distribution:
DEFAULT
0    0.716
1    0.284
Name: proportion, dtype: float64
Out[181]:
INCOME SAVINGS DEBT R_SAVINGS_INCOME R_DEBT_INCOME R_DEBT_SAVINGS T_CLOTHING_12 T_CLOTHING_6 R_CLOTHING R_CLOTHING_INCOME ... R_EXPENDITURE R_EXPENDITURE_INCOME R_EXPENDITURE_SAVINGS R_EXPENDITURE_DEBT CAT_GAMBLING CAT_DEBT CAT_CREDIT_CARD CAT_MORTGAGE CAT_SAVINGS_ACCOUNT CAT_DEPENDENTS
0 -0.641653 -2.221179 0.272084 -1.901185 1.571900 -0.258144 -0.446605 -0.173849 0.192660 0.316520 ... 0.842411 0.504703 -1.773710 -1.087520 1.558246 0.243561 -0.555788 -0.457373 -11.91038 -0.420084
1 -0.054784 -0.586871 -0.027286 -0.704448 -0.057857 0.518035 0.287534 -1.022509 -1.847030 0.826975 ... -1.939268 -0.082071 0.679491 -0.166360 -0.740052 0.243561 -0.555788 -0.457373 0.08396 -0.420084
2 -0.685076 -1.192847 0.275012 -1.071609 1.684639 1.719951 -0.697706 -0.219659 1.222678 -0.391910 ... 0.718195 0.504703 1.268266 -1.122332 1.558246 0.243561 -0.555788 -0.457373 0.08396 -0.420084
3 -0.018782 -0.758162 0.376122 -0.986246 0.634056 1.224568 0.416664 0.597256 0.350766 1.040342 ... -0.176409 0.504703 1.155687 -0.652750 1.558246 0.243561 -0.555788 -0.457373 0.08396 -0.420084
4 0.547953 1.448408 1.381167 1.108251 1.571900 0.126270 -0.421101 -0.647558 -1.237920 -1.628593 ... -0.654451 -0.082071 -1.084406 -1.129809 1.558246 0.243561 1.799247 2.186400 0.08396 2.380476
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 1.466869 1.714589 1.552338 0.528485 1.635792 0.577909 1.236307 1.327408 0.643664 0.119182 ... 0.084570 -0.082071 -0.684209 -1.148612 1.558246 0.243561 1.799247 2.186400 0.08396 2.380476
996 -0.011228 -0.600640 0.426788 -0.766068 0.717376 1.073535 0.230975 0.205494 -0.408395 0.594839 ... -1.096107 -0.082071 0.763783 -0.772383 -0.740052 0.243561 -0.555788 -0.457373 0.08396 -0.420084
997 -2.309647 -0.942606 -0.982664 0.206141 0.691980 -0.579489 -2.111023 -1.979688 1.793215 -2.186079 ... 2.887578 0.877940 -0.505927 0.299522 -0.740052 0.243561 -0.555788 -0.457373 0.08396 -0.420084
998 -0.593478 -1.478950 0.350547 -1.579451 1.640142 2.141093 -0.416883 -0.023856 0.775066 0.268842 ... 1.726977 1.102285 1.877274 -1.060494 -0.740052 0.243561 1.799247 -0.457373 0.08396 -0.420084
999 -0.461468 0.190279 -0.866535 0.994166 -1.079319 -1.199065 -0.544323 -0.026735 1.475732 -0.468859 ... -0.483811 1.102285 -0.880463 1.460234 -0.740052 0.243561 -0.555788 -0.457373 0.08396 -0.420084

1000 rows × 84 columns

Class Balance of DEFAULT variable¶

In [182]:
# Count DEFAULT values
default_counts = df['DEFAULT'].value_counts().sort_index()

# Plot
plt.figure(figsize=(6, 4))
sns.barplot(
    x=default_counts.index,
    y=default_counts.values,
    palette=['steelblue', 'tomato']
)

# Labels and title
plt.xlabel("DEFAULT (0 = Non-Default, 1 = Default)")
plt.ylabel("Number of Samples")
plt.title("Class Balance of DEFAULT Variable")

# Add percentage labels on bars
total = default_counts.sum()
for i, count in enumerate(default_counts.values):
    percentage = count / total * 100
    plt.text(i, count, f"{percentage:.1f}%", ha='center', va='bottom')

plt.tight_layout()
plt.show()
/tmp/ipykernel_4647/1663573361.py:6: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(

6. Prediction Task : Credit Default Classification¶

Problem Definition

The objective of this project is to predict the binary target variable DEFAULT, which indicates whether a client defaults on their credit repayment.

Formally, the task is a supervised binary classification problem defined as:

$$ y_i \in \{0,1\}, \quad i = 1, \dots, n $$

where:

  • $y_i = 1$ denotes a default customer,
  • $y_i = 0$ denotes a non-default customer,
  • $\mathbf{X}_i \in \mathbb{R}^p$ represents the feature vector describing client $i$ in the credit_score dataset.¶

6.1. Baseline Model: Logistic Regression¶

Logistic Regression was used as a baseline statistical model due to its interpretability and strong theoretical foundations.

The logistic regression model estimates the conditional probability:

$$ P(y = 1 \mid \mathbf{X}) = \sigma(\beta_0 + \mathbf{X}^\top \boldsymbol{\beta}) $$

where:

$$ \sigma(z) = \frac{1}{1 + e^{-z}}, $$

and $\boldsymbol{\beta} \in \mathbb{R}^p$ is the vector of regression coefficients.

The model parameters are estimated by minimizing the negative log-likelihood (binary cross-entropy loss):

$$ \mathcal{L}(\boldsymbol{\beta}) = - \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i)\log(1 - p_i) \right], $$

where $p_i = P(y_i = 1 \mid \mathbf{X}_i)$.


In [183]:
# 1. Split Data for Classification 
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_scaled_df, y_classification, test_size=0.3, 
                                            random_state=42, stratify=y_classification)

# 2. Implement Baseline Model --- Logistic Regression---
log_reg         = LogisticRegression(class_weight='balanced', random_state=42, solver='liblinear')
log_reg.fit(X_train_c, y_train_c)

# 3. Evaluate Model
y_pred_prob_c   = log_reg.predict_proba(X_test_c)[:, 1]

# Calculate AUC-ROC
auc_roc         = roc_auc_score(y_test_c, y_pred_prob_c)
print("\n--- Classification Results (Logistic Regression) ---")
print(f"AUC-ROC Score on Test Set: {auc_roc:.4f}")

# Get feature importance scores
#importance_c = pd.Series(log_reg.feature_importances_, index=X_train_c.columns)

# Calculate other metrics
y_pred_c = log_reg.predict(X_test_c)
print("\nClassification Report:")
print(classification_report(y_test_c, y_pred_c))
--- Classification Results (Logistic Regression) ---
AUC-ROC Score on Test Set: 0.6306

Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.73      0.76       215
           1       0.44      0.54      0.48        85

    accuracy                           0.67       300
   macro avg       0.62      0.63      0.62       300
weighted avg       0.70      0.67      0.68       300

image-2.png

--- Interpreting Baseline Classification Results ---

Class 0(Non-Default)

  • Precision = 0.79: means that Of all customers the model predicted would not default, 79% were correct. This means 21% of those predicted as safe were actually defaulters (False Negatives).

  • Recall = 0.73: means that Of all customers who actually did not default, 73% were correctly identified. This means the model incorrectly flagged 27% of truly safe customers as defaulters (False Positives).

  • F1-Score = 0.76: A moderate score, demonstrating fair but not strong performance on the majority class.

Class 1(Default/The Target Risk Class)

  • Precision = 0.43 Low: Of all customers the model predicted would default, only 43% were correct. This means the model generates a high number of False Alarms (57% of those flagged were actually safe).

  • Recall = 0.52 Moderate: Of all customers who actually defaulted, 52% were correctly identified (or caught). This means 48% of true defaulters (high-risk individuals) were missed by the model (False Negatives).

  • F1-Score = 0.47: A weak balance between catching defaulters(1) and minimizing false alarms.

Overall Performance --> The low AUC =0.6256 confirms that this model needs to be replaced.

In [184]:
# 1. Calculate ROC curve components (False Positive Rate and True Positive Rate)
fpr, tpr, thresholds = roc_curve(y_test_c, y_pred_prob_c)

# 2. Calculate the Area Under the Curve (AUC)
roc_auc              = auc(fpr, tpr)

# 3. Plot the ROC Curve
plt.figure(figsize=(8, 6))

# Plot the ROC curve for the model
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'RF ROC curve (AUC = {roc_auc:.4f})')

# Plot the diagonal line 
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Chance (AUC = 0.50)')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR) / Recall')
plt.title('Receiver Operating Characteristic (ROC) Curve - Baseline Classification')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

Get Logistic Regression coefficients as feature importance and Plot¶

In [185]:
importance_log = pd.Series(
    log_reg.coef_[0],
    index=X_train_c.columns
).sort_values(key=abs, ascending=False)
print(importance_log)

plt.figure(figsize=(8, 6))
importance_log.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Logistic Regression Feature Importance (Top 15)')
plt.xlabel('Coefficient Value')
plt.grid(alpha=0.3)
plt.show()
R_GROCERIES_INCOME     0.786413
R_HEALTH_INCOME        0.681634
R_EDUCATION_SAVINGS    0.671410
T_TRAVEL_12           -0.667495
R_HEALTH_SAVINGS      -0.571015
                         ...   
R_SAVINGS_INCOME       0.020393
T_HEALTH_12           -0.017926
CAT_DEPENDENTS        -0.016814
R_CLOTHING_SAVINGS    -0.012167
DEBT                   0.002508
Length: 84, dtype: float64
Interpretation of Logistic Regression Feature Importance Above¶
  • Risk-increasing indicators (positive coefficients)

    • R_GROCERIES_INCOME (+0.77): A higher ratio of grocery expenditure to income is strongly associated with increased default risk, suggesting that individuals allocating a large portion of their income to essential expenses are more financially constrained.

    • R_HEALTH_INCOME (+0.70) and R_EDUCATION_SAVINGS (+0.68): Higher spending relative to income or savings in these categories also increases default probability.

  • Risk-reducing indicators (negative coefficients)

    • T_TRAVEL_12 (-0.70): Borrowers with discretionary travel expenditures are less likely to default, suggesting greater financial stability.

    • R_HEALTH_SAVINGS (-0.59): Higher health-related savings reduce the likelihood of default, providing a financial buffer against unexpected expenses.

  • Low-impact features (near-zero coefficients)

    Several features, including entertainment spending, debt level, and gambling expenditure, have coefficients near zero, indicating minimal contribution to default prediction when considered alongside other variables.

Overall, the results indicate that high essential expenditure ratios are strong predictors of default risk, whereas higher savings and discretionary spending signal financial stability.


6.2. Random Forest Classifier¶

Random Forest is an ensemble learning method that combines multiple decision trees to improve predictive performance and reduce overfitting. Unlike a single decision tree, which is prone to overfitting, Random Forest constructs a collection of trees, each trained on a bootstrap sample of the data and a random subset of features. This randomness increases model diversity and enhances generalization performance.

In the classification setting, each decision tree produces a class prediction or a class probability. The final prediction is obtained by aggregating predictions across all trees using majority voting or probability averaging. The model can be expressed as:

$$ \hat{p}(y=1 \mid X) = \frac{1}{M}\sum_{m=1}^M f_m(X) $$$$ \hat{y} = \mathbb{I}\bigl(\hat{p}(y=1 \mid X) \ge \tau \bigr) $$

where:

  • $\hat{y}$ is the predicted class label (DEFAULT in this study),
  • $M$ is the number of decision trees in the forest,
  • $f_m(X)$ is the predicted probability produced by the $m^{\text{th}}$ decision tree,
  • $X$ is the input feature vector,
  • $\tau$ is the classification threshold.

Random Forest models effectively capture non-linear interactions and are robust to noise and multicollinearity, which are common characteristics of financial datasets.


In [186]:
# Train a Random Forest Classification
rf_c         = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rf_c.fit(X_train_c, y_train_c)

# Get feature importance scores
importance_c = pd.Series(rf_c.feature_importances_, index=X_train_c.columns)

# Y prediction classification random forest
y_pred_rf       = rf_c.predict(X_test_c)
y_pred_proba_rf = rf_c.predict_proba(X_test_c)[:, 1] 

# Accuracy Score
accuracy        = accuracy_score(y_test_c, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy:.4f}")

# AUC-ROC Score 
auc_roc         = roc_auc_score(y_test_c, y_pred_proba_rf)
print(f"Random Forest AUC-ROC Score: {auc_roc:.4f}")

print("\nClassification Report:")
print(classification_report(y_test_c, y_pred_rf))
Random Forest Accuracy: 0.7067
Random Forest AUC-ROC Score: 0.6148

Classification Report:
              precision    recall  f1-score   support

           0       0.73      0.95      0.82       215
           1       0.42      0.09      0.15        85

    accuracy                           0.71       300
   macro avg       0.57      0.52      0.49       300
weighted avg       0.64      0.71      0.63       300

In [187]:
# 1. Calculate ROC curve components
fpr, tpr, thresholds = roc_curve(y_test_c, y_pred_proba_rf)

# 2. Calculate the Area Under the Curve (AUC)
roc_auc              = auc(fpr, tpr)

# 3. Plot the ROC Curve
plt.figure(figsize=(8, 6))

# Plot the ROC curve for the model
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'RF ROC curve (AUC = {roc_auc:.4f})')

# Plot the diagonal line
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Chance (AUC = 0.50)')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR) / Recall')
plt.title('Receiver Operating Characteristic (ROC) Curve - Random Forest')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

Top 10 Features From Random Forest for Predicting DEFAULT¶

In [188]:
def plot_feature_importance(importance_series, title, top_n=10):
    """Plots the top N features based on importance scores"""
    plt.figure(figsize=(10,6))
    top_features = importance_series.nlargest(top_n)
    sns.barplot(x=top_features.values, y=top_features.index, palette='viridis')
    plt.title(title)
    plt.xlabel('Feature Importance Score')
    plt.ylabel('Features')
    plt.show()

# Plot the top 10 most important features
plot_feature_importance(importance_c, 'Top 10 Features From Random Forest for Predicting DEFAULT', top_n=10)
/tmp/ipykernel_4647/2634937592.py:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=top_features.values, y=top_features.index, palette='viridis')

6.3. XGBoost Classifier¶

Extreme Gradient Boosting (XGBoost) is an ensemble learning algorithm based on gradient boosting decision trees. Unlike bagging-based methods such as Random Forest, XGBoost builds decision trees sequentially, where each new tree aims to correct the prediction errors of the previous ensemble. This boosting strategy enables high predictive accuracy, particularly in complex and imbalanced datasets.

For binary classification, XGBoost estimates the probability of the positive class as an additive combination of decision trees:

$$ \hat{p}(y=1 \mid X) = \sigma\left(\sum_{m=1}^{M} f_m(X)\right) $$

where $\sigma(\cdot)$ denotes the logistic (sigmoid) function.
The final class prediction is obtained using a fixed decision threshold:

$$ \hat{y} = \mathbb{I}\bigl(\hat{p}(y=1 \mid X) \ge 0.5 \bigr) $$

Here, $f_m(X)$ represents the contribution of the $m^{\text{th}}$ decision tree, $M$ is the total number of trees, and $X$ is the input feature vector.

XGBoost optimizes a regularized objective function that combines a differentiable loss function with a complexity penalty on the trees:

$$ \mathcal{L} = \sum_{i=1}^{N} \ell(y_i, \hat{y}_i) + \sum_{m=1}^{M} \Omega(f_m), $$

with:

$$ \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2, $$

where $\ell$ denotes the logistic loss, $T$ is the number of leaves, $w_j$ are the leaf weights, and $\gamma$ and $\lambda$ are regularization parameters.

XGBoost further incorporates shrinkage and structural constraints (e.g., learning rate and maximum tree depth) to control model complexity and reduce overfitting, making it well-suited for credit risk classification tasks involving non-linear feature interactions.


In [189]:
# Train XGBoost Classification with optimal parameters
xgb_c = XGBClassifier(
    n_estimators    = 100
    , max_depth     = 6
    , learning_rate = 0.1
    , random_state  = 42
    , eval_metric   = 'logloss'
)

# Fit the model
xgb_c.fit(X_train_c.values, y_train_c.values)

# Make predictions
y_pred_xgb       = xgb_c.predict(X_test_c.values)
y_pred_proba_xgb = xgb_c.predict_proba(X_test_c.values)[:, 1]

print("XGBoost Classification Model trained successfully!")
XGBoost Classification Model trained successfully!
In [190]:
# Calculate metrics
accuracy_xgb    = accuracy_score(y_test_c, y_pred_xgb)
precision_xgb   = precision_score(y_test_c, y_pred_xgb)
recall_xgb      = recall_score(y_test_c, y_pred_xgb)
f1_xgb          = f1_score(y_test_c, y_pred_xgb)
roc_auc_xgb     = roc_auc_score(y_test_c, y_pred_proba_xgb)

print("=== XGBoost Classification Metrics ===")
print(f"Accuracy:  {accuracy_xgb:.4f}")
print(f"Precision: {precision_xgb:.4f}")
print(f"Recall:    {recall_xgb:.4f}")
print(f"F1-Score:  {f1_xgb:.4f}")
print(f"ROC-AUC:   {roc_auc_xgb:.4f}")
print("\n=== Classification Report ===")
print(classification_report(y_test_c, y_pred_xgb))
print("\n=== Confusion Matrix ===")
print(confusion_matrix(y_test_c, y_pred_xgb))
=== XGBoost Classification Metrics ===
Accuracy:  0.7033
Precision: 0.4500
Recall:    0.2118
F1-Score:  0.2880
ROC-AUC:   0.6070

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.74      0.90      0.81       215
           1       0.45      0.21      0.29        85

    accuracy                           0.70       300
   macro avg       0.60      0.55      0.55       300
weighted avg       0.66      0.70      0.66       300


=== Confusion Matrix ===
[[193  22]
 [ 67  18]]

Top 10 Features From XGBoost for Predicting DEFAULT¶

In [191]:
# XGBoost Feature Importance
importance_xgb_c = pd.Series(xgb_c.feature_importances_, index=X_train_c.columns)

# Plot feature importance
plot_feature_importance(importance_xgb_c, 'Top 10 Features From XGBoost for Predicting DEFAULT', top_n=10)
/tmp/ipykernel_4647/2634937592.py:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=top_features.values, y=top_features.index, palette='viridis')
In [192]:
fpr_xgb, tpr_xgb, thresholds_xgb    = roc_curve(y_test_c, y_pred_proba_xgb)
roc_auc_xgb_curve                   = auc(fpr_xgb, tpr_xgb)

# Plot ROC Curve
plt.figure(figsize=(10, 6))
plt.plot(fpr_xgb, tpr_xgb, color='darkorange', lw=2, 
         label=f'XGBoost ROC curve (AUC = {roc_auc_xgb_curve:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - XGBoost Classification')
plt.legend(loc="lower right")
plt.grid(alpha=0.3)
plt.show()

6.4. Model Comparison: Logistic Regression, Random Forest vs XGBoost (Classification)¶

The performance of the three classification models: Logistic Regression, Random Forest, and XGBoost was evaluated on the test dataset using Accuracy, Precision, Recall, F1-score, and ROC-AUC metrics for default customer (Class 1). The results are summarized in Table Classification Model Comparison below.

In [193]:
# -------- Logistic Regression metrics --------
y_pred_log        = log_reg.predict(X_test_c)
y_pred_proba_log  = log_reg.predict_proba(X_test_c)[:, 1]

accuracy_log  = accuracy_score(y_test_c, y_pred_log)
precision_log = precision_score(y_test_c, y_pred_log)
recall_log    = recall_score(y_test_c, y_pred_log)
f1_log        = f1_score(y_test_c, y_pred_log)
roc_auc_log   = roc_auc_score(y_test_c, y_pred_proba_log)

# ROC curve for Logistic Regression
fpr_log, tpr_log, _ = roc_curve(y_test_c, y_pred_proba_log)


# -------- Random Forest metrics --------
y_pred_rf        = rf_c.predict(X_test_c)
y_pred_proba_rf  = rf_c.predict_proba(X_test_c)[:, 1]

accuracy_rf  = accuracy_score(y_test_c, y_pred_rf)
precision_rf = precision_score(y_test_c, y_pred_rf)
recall_rf    = recall_score(y_test_c, y_pred_rf)
f1_rf        = f1_score(y_test_c, y_pred_rf)
roc_auc_rf   = roc_auc_score(y_test_c, y_pred_proba_rf)

# ROC curve for Random Forest
fpr_rf, tpr_rf, _ = roc_curve(y_test_c, y_pred_proba_rf)

# ROC curve for XGBoost
fpr_xgb, tpr_xgb, _ = roc_curve(y_test_c, y_pred_proba_xgb)

# -------- Create comparison dataframe --------
comparison_c = pd.DataFrame({
    'Model'     : ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'Accuracy'  : [accuracy_log, accuracy_rf, accuracy_xgb],
    'Precision' : [precision_log, precision_rf, precision_xgb],
    'Recall'    : [recall_log, recall_rf, recall_xgb],
    'F1-Score'  : [f1_log, f1_rf, f1_xgb],
    'ROC-AUC'   : [roc_auc_log, roc_auc_rf, roc_auc_xgb]
})

print("=== Classification Model Comparison ===")
print(comparison_c.to_string(index=False))


fig, axes = plt.subplots(1, 2, figsize=(18, 6))
def add_bar_labels(ax):
    for container in ax.containers:
        ax.bar_label(container, fmt='%.3f', label_type='edge', padding=10)

# --- Bar Plot: Metrics Comparison ---
comparison_c.set_index('Model').plot(
    kind='bar',
    ax=axes[0],
    rot=0
)

axes[0].set_title('Classification Metrics Comparison', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Score')
axes[0].set_ylim([0, 1.0])
axes[0].legend(loc='lower right')
axes[0].grid(axis='y', alpha=0.3)

add_bar_labels(axes[0])

# --- ROC Curves Comparison ---
axes[1].plot(fpr_log, tpr_log, lw=2,
            label=f'Logistic Regression (AUC = {roc_auc_log:.4f})')

axes[1].plot(fpr_rf, tpr_rf, lw=2,
            label=f'Random Forest (AUC = {roc_auc_rf:.4f})')

axes[1].plot(fpr_xgb, tpr_xgb, lw=2,
            label=f'XGBoost (AUC = {roc_auc_xgb:.4f})')

axes[1].plot([0, 1], [0, 1], 'k--', lw=1.5, label='Random Classifier')

axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curves Comparison', fontsize=14, fontweight='bold')
axes[1].legend(loc="lower right")
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()
=== Classification Model Comparison ===
              Model  Accuracy  Precision   Recall  F1-Score  ROC-AUC
Logistic Regression  0.673333   0.438095 0.541176  0.484211 0.630588
      Random Forest  0.706667   0.421053 0.094118  0.153846 0.614802
            XGBoost  0.703333   0.450000 0.211765  0.288000 0.607004

Logistic Regression demonstrates the highest recall (0.517) and F1-score (0.468) among the three models, indicating a better balance between identifying default cases and maintaining classification precision. This suggests that the baseline model is effective at capturing the underlying linear relationships in the dataset and provides stable performance despite its simplicity.

Random Forest achieves higher overall accuracy (0.706) compared to Logistic Regression. However, it exhibits a very low recall (0.094), indicating poor detection of default cases. This behavior suggests that the model is biased toward the majority class, which limits its usefulness in credit risk classification where identifying defaulters is critical.

XGBoost attains the highest accuracy (0.713) and precision (0.486), reflecting stronger performance in correctly identifying non-default cases. Nevertheless, its recall (0.211) remains substantially lower than that of Logistic Regression, indicating that many default instances are still missed. While XGBoost captures complex non-linear patterns, its conservative predictions reduce its effectiveness in detecting high-risk borrowers.


7: Regression Task: Predicting CREDIT_SCORE¶

7.1. Target Transformation for Improved Regression Performance¶

Why transform CREDIT_SCORE?

  • Kolmogorov-Smirnov test showed CREDIT_SCORE is non-normal (p=0.0003)
  • Linear models assume normally distributed residuals
  • Transformation can normalize the distribution and improve R²

Although tree-based models do not require normality, linear regression performance is often improved when the target (and consequently residuals) are closer to a symmetric distribution. Transformation can stabilize variance and reduce the impact of skewness, which helps satisfy the underlying assumptions of linear regression.

We applied the Yeo–Johnson transformation because it is more flexible than Box–Cox: it supports both positive and non-positive values. The transformation is defined as:

$$ y^{(\lambda)} = \begin{cases} \dfrac{(y + 1)^\lambda - 1}{\lambda}, & \text{if } y \ge 0, \ \lambda \ne 0,\\[6pt] \log(y+1), & \text{if } y \ge 0, \ \lambda = 0,\\[6pt] -\dfrac{(-y + 1)^{2-\lambda}-1}{2-\lambda}, & \text{if } y < 0, \ \lambda \ne 2,\\[6pt] -\log(-y+1), & \text{if } y < 0, \ \lambda = 2. \end{cases} $$

In our experiment, the transformed target becomes standardized:

  • Original CREDIT_SCORE: mean $= 588.36$, std $= 58.09$
  • Transformed target: mean $\approx 0.00$, std $\approx 1.00$

In [194]:
pt = PowerTransformer(method='yeo-johnson', standardize=True)

# Fit and transform CREDIT_SCORE
y_regression_transformed        = pt.fit_transform(y_regression.values.reshape(-1, 1)).ravel()

print(f"Original CREDIT_SCORE - Mean: {y_regression.mean():.2f}, Std: {y_regression.std():.2f}")
print(f"Transformed CREDIT_SCORE - Mean: {y_regression_transformed.mean():.2f}, Std: {y_regression_transformed.std():.2f}")
Original CREDIT_SCORE - Mean: 588.36, Std: 58.09
Transformed CREDIT_SCORE - Mean: 0.00, Std: 1.00

7.2. Baseline Model: Linear Regression¶

As a baseline, we used Ordinary Least Squares (OLS) Linear Regression due to its interpretability and strong theoretical foundation.
The goal is to estimate a linear relationship between predictors and credit score, providing a transparent benchmark that is easy to explain.

The model assumes:

$$ \hat{y}_i = \beta_0 + \mathbf{X}_i^\top \boldsymbol{\beta} + \varepsilon_i, \quad \text{with } \mathbb{E}[\varepsilon_i] = 0 $$

The parameters are found by minimizing the sum of squared residuals:

$$ \min_{\beta_0,\boldsymbol{\beta}} \sum_{i=1}^{n} \left( y_i - \beta_0 - \mathbf{X}_i^\top \boldsymbol{\beta} \right)^2 $$

Linear regression provides two advantages in this context:

  • Interpretability: coefficients indicate the direction and relative strength of feature effects.
  • Strong baseline: it reveals how much variance can be explained using only linear relationships.

In [195]:
# 1. Split Data for Regression
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_scaled_df, y_regression, test_size=0.3, random_state=42)

# 2. Implement Baseline Model ---Linear Regression---
lin_reg   = LinearRegression()
lin_reg.fit(X_train_r, y_train_r)

# 3. Evaluate Model
y_pred_r  = lin_reg.predict(X_test_r)

# Calculate Evalution Metrics 
rmse      = np.sqrt(mean_squared_error(y_test_r, y_pred_r))
r2           = r2_score(y_test_r, y_pred_r)
mean_abs_err = np.mean(np.abs(y_test_r - y_pred_r))

print("\n--- Regression Results (Linear Regression) ---")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R^2): {r2:.4f}")
print(f"Mean Absolute Error (MAE): {mean_abs_err:.2f}")
--- Regression Results (Linear Regression) ---
Root Mean Squared Error (RMSE): 29.90
R-squared (R^2): 0.7432
Mean Absolute Error (MAE): 21.80

These results show that a significant portion of credit score variability is already captured by linear combinations of standardized financial ratios and spending indicators.

7.3. Random Forest Regressor¶

Random Forest Regression is an ensemble method designed to capture non-linear patterns by combining many decision trees.
Each tree is trained on a bootstrap sample of the training set, and at each split only a random subset of features is considered. This injected randomness reduces correlation between trees and improves generalization.

The prediction is obtained by averaging the outputs of $M$ trees:

$$ \hat{y}(X) = \frac{1}{M}\sum_{m=1}^M f_m(X) $$
In [196]:
# Train feature importance 
rf_r          = RandomForestRegressor(n_estimators=100, random_state=42)
rf_r.fit(X_train_r, y_train_r)

# Get feature importance scores
importance_r = pd.Series(rf_r.feature_importances_, index=X_train_r.columns)

# Evaluate Model
y_pred_regrf = rf_r.predict(X_test_r)

# Calculate Evaluation Metrics
rmse_rf = np.sqrt(mean_squared_error(y_test_r, y_pred_regrf))
r2_rf   = r2_score(y_test_r, y_pred_regrf)
mae_rf  = np.mean(np.abs(y_test_r - y_pred_regrf))

print("--- Regression Result after Applying Random Forest ---")
print(f"Root Mean Squared Error (RMSE): {rmse_rf:.2f}")
print(f"R-squared (R^2): {r2_rf:.4f}")
print(f"Mean Absolute Error (MAE): {mae_rf:.2f}")
--- Regression Result after Applying Random Forest ---
Root Mean Squared Error (RMSE): 28.07
R-squared (R^2): 0.7736
Mean Absolute Error (MAE): 20.59

Random Forest achieved the best regression performance, improving both error metrics and explained variance compared to the linear baseline.

Bar plot shows the top 10 importants features for Predicting Y=CREDITSCORE¶

In [197]:
# Plot the top 10 most important features
plot_feature_importance(importance_r, 'Top 10 Features for Predicting CREDIT_SCORE', top_n=10)
/tmp/ipykernel_4647/2634937592.py:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=top_features.values, y=top_features.index, palette='viridis')

The feature importance plot shows that R_DEBT_INCOME dominates prediction. This is consistent with financial intuition: clients with high debt relative to income tend to have weaker repayment capacity and thus lower creditworthiness.


7.4. XGBoost Regressor¶

XGBoost (Extreme Gradient Boosting) is a boosting-based ensemble method that builds decision trees sequentially.
Unlike Random Forest (bagging), boosting focuses on correcting prediction errors iteratively: each new tree is trained to reduce the residuals of the previous model.

The model prediction is:

$$ \hat{y}(X) = \sum_{m=1}^{M} f_m(X) $$

XGBoost minimizes a regularized objective function that balances fit and complexity:

$$ \mathcal{L} = \sum_{i=1}^{n} \ell(y_i, \hat{y}_i) + \sum_{m=1}^{M} \Omega(f_m), \quad \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2 $$

where $T$ is the number of leaves and $w_j$ are the leaf weights.
This regularization is important in credit modeling because it reduces overfitting and encourages simpler tree structures.

In [198]:
# Train XGBoost Regression with optimal parameters
xgb_r = XGBRegressor(
    n_estimators    = 100
    , max_depth     = 6
    , learning_rate = 0.1
    , random_state  = 42
    , objective     = 'reg:squarederror'
)

# Fit the model
xgb_r.fit(X_train_r.values, y_train_r.values)

# Make predictions
y_pred_xgb_r = xgb_r.predict(X_test_r.values)

print("XGBoost Regression Model trained successfully!")
XGBoost Regression Model trained successfully!
In [199]:
# Calculate metrics
mae_xgb     = mean_absolute_error(y_test_r, y_pred_xgb_r)
mse_xgb     = mean_squared_error(y_test_r, y_pred_xgb_r)
rmse_xgb    = np.sqrt(mse_xgb)
r2_xgb      = r2_score(y_test_r, y_pred_xgb_r)

print("=== XGBoost Regression Metrics ===")
print(f"Mean Absolute Error (MAE):  {mae_xgb:.4f}")
print(f"Mean Squared Error (MSE):   {mse_xgb:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_xgb:.4f}")
print(f"R² Score:                   {r2_xgb:.4f}")
=== XGBoost Regression Metrics ===
Mean Absolute Error (MAE):  21.6811
Mean Squared Error (MSE):   871.8902
Root Mean Squared Error (RMSE): 29.5278
R² Score:                   0.7495

XGBoost performs similarly to the linear baseline but slightly worse than Random Forest. A likely explanation is that the dataset size (1000) and parameter choices (learning rate, depth) may limit the advantage of boosting, while Random Forest benefits more from variance reduction.

Top 10 Features From XGBoost for Predicting CREDIT_SCORE¶

In [200]:
# XGBoost Feature Importance for Regression
importance_xgb_r = pd.Series(xgb_r.feature_importances_, index=X_train_r.columns)

# Plot feature importance
plot_feature_importance(importance_xgb_r, 'Top 10 Features From XGBoost for Predicting CREDIT_SCORE', top_n=10)
/tmp/ipykernel_4647/2634937592.py:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=top_features.values, y=top_features.index, palette='viridis')

XGBoost also emphasizes R_DEBT_INCOME as the primary driver, followed by expenditure-related ratios, reinforcing that relative financial burden is key for credit score prediction.

Model Diagnostics: XGBoost Predictions and Residuals¶

To verify model behavior beyond metrics, we analyzed diagnostic plots: predicted vs actual values and residual distributions.
A good regression model should produce points close to the $y=x$ line (perfect prediction) and residuals should be centered around zero without strong structure.

In [201]:
# Visualize Predictions vs Actual for XGBoost
plt.figure(figsize=(10, 6))
plt.scatter(y_test_r, y_pred_xgb_r, alpha=0.6, edgecolors='k')
plt.plot([y_test_r.min(), y_test_r.max()], [y_test_r.min(), y_test_r.max()], 
         'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual CREDIT_SCORE')
plt.ylabel('Predicted CREDIT_SCORE')
plt.title('XGBoost: Predicted vs Actual CREDIT_SCORE')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# Residual plot
residuals_xgb = y_test_r - y_pred_xgb_r
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_xgb_r, residuals_xgb, alpha=0.6, edgecolors='k')
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel('Predicted CREDIT_SCORE')
plt.ylabel('Residuals')
plt.title('XGBoost: Residual Plot')
plt.grid(alpha=0.3)
plt.show()

The plots indicate that predictions generally follow the perfect prediction line, but dispersion increases for extreme credit scores.
The residual plot is mostly centered around zero, but a few large residuals appear, showing that some clients have atypical profiles not well captured by the model (potential outliers or rare spending/debt patterns).


7.5 Model Comparison: Random Forest vs XGBoost (Regression)¶

We compared Random Forest and XGBoost using standard regression metrics:

  • MAE (Mean Absolute Error): average absolute deviation,
  • RMSE (Root Mean Squared Error): penalizes large errors more strongly,
  • R^2: fraction of variance explained by the model.
In [202]:
# Get Random Forest regression metrics 
y_pred_rf_r = rf_r.predict(X_test_r)

mae_rf  = mean_absolute_error(y_test_r, y_pred_rf_r)
mse_rf  = mean_squared_error(y_test_r, y_pred_rf_r)
rmse_rf = np.sqrt(mse_rf)
r2_rf   = r2_score(y_test_r, y_pred_rf_r)

# Create comparison dataframe
comparison_r = pd.DataFrame({
    'Model'   : ['Random Forest', 'XGBoost']
    , 'MAE'     : [mae_rf, mae_xgb]
    , 'MSE'     : [mse_rf, mse_xgb]
    , 'RMSE'    : [rmse_rf, rmse_xgb]
    , 'R² Score': [r2_rf, r2_xgb]
})

print("=== Regression Model Comparison ===")
print(comparison_r.to_string(index=False))

fig, axes       = plt.subplots(1, 2, figsize=(16, 5))

error_metrics   = comparison_r[['Model', 'MAE', 'RMSE']].set_index('Model')
error_metrics.plot(kind='bar', ax=axes[0], rot=0, color=['skyblue', 'lightcoral'])
axes[0].set_title('Regression Error Metrics Comparison (Lower is Better)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Error Value')
axes[0].legend(loc='upper right')
axes[0].grid(axis='y', alpha=0.3)

r2_comparison = comparison_r[['Model', 'R² Score']].set_index('Model')
r2_comparison.plot(kind='bar', ax=axes[1], rot=0, color='lightgreen', legend=False)
axes[1].set_title('R² Score Comparison (Higher is Better)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('R² Score')
axes[1].set_ylim([0, 1.0])
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

best_model_r2   = comparison_r.loc[comparison_r['R² Score'].idxmax(), 'Model']
best_model_mae  = comparison_r.loc[comparison_r['MAE'].idxmin(), 'Model']

print(f"\nBest model by R² Score: {best_model_r2}")
print(f"Best model by MAE: {best_model_mae}")
=== Regression Model Comparison ===
        Model       MAE        MSE      RMSE  R² Score
Random Forest 20.585950 788.099835 28.073116  0.773618
      XGBoost 21.681101 871.890225 29.527787  0.749549
Best model by R² Score: Random Forest
Best model by MAE: Random Forest

Interpretation: Random Forest outperforms XGBoost across all metrics. This indicates that averaging many trees is particularly effective for this dataset, likely because it reduces variance and captures non-linearities without requiring careful boosting hyperparameter tuning.


8. Using PCA dataframe to fit a model¶

To reduce feature dimensionality and mitigate multicollinearity, Principal Component Analysis (PCA) was applied prior to classification.

The models were trained and evaluated using the first 18 principal components derived from the dataset.

Data Preparation using PCA Features¶

In [203]:
# Call PCA dataframe that we have created above to work. 
df_pca.head()
Out[203]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC13 PC14 PC15 PC16 PC17 PC18 DEFAULT CREDIT_SCORE CATEGORY Cluster
0 -1.263568 1.675703 1.616015 -1.777544 0.900540 0.361975 -2.571286 -0.326115 0.336299 0.484445 ... -0.392118 -0.575822 0.026258 1.003179 -0.838064 0.592423 1 444.0 Poor 2
1 -1.113713 0.097301 -0.083219 -1.488284 -0.278663 -1.442984 1.640455 -2.546311 -2.071634 1.182105 ... 0.011836 -1.387963 -1.021540 0.263893 -0.475882 0.268768 0 625.0 Good 2
2 0.244502 6.755585 -0.368538 1.002296 4.414212 1.369341 -3.355328 0.337632 -1.563512 0.385298 ... -0.693706 -0.040053 0.396325 0.708798 0.652537 1.179665 1 469.0 Poor 5
3 2.228464 5.653206 2.395674 0.698185 -1.647916 0.834930 0.376260 -0.190960 0.288253 1.457348 ... 0.668575 -0.342176 0.983585 -1.665101 -0.285464 0.304583 0 559.0 Fair 3
4 1.300493 -2.019472 -2.782794 0.626834 -0.635844 -2.531894 -2.460998 -1.658941 0.163559 0.804533 ... 0.124539 0.018004 0.481896 -0.458401 -0.139912 -0.680096 0 473.0 Poor 0

5 rows × 22 columns

In [204]:
# Features (X_pca): Use PC1 through PC10
X_pca = df_pca[['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10'
                , 'PC11', 'PC12', 'PC13', 'PC14', 'PC15', 'PC16', 'PC17', 'PC18']]

# Targets
y_classification = df_pca['DEFAULT']
y_regression     = df_pca['CREDIT_SCORE']

print(f"PCA Feature set shape: {X_pca.shape}")
PCA Feature set shape: (1000, 18)

8.1 Logistic Regression Predicting DEFAULT (Using PCA)¶

In [205]:
X_train_pca_c, X_test_pca_c, y_train_c, y_test_c = train_test_split(X_pca, y_classification, 
                                test_size=0.3, random_state=42, stratify=y_classification)

log_reg_pca        = LogisticRegression(class_weight='balanced', random_state=42, solver='liblinear')
log_reg_pca.fit(X_train_pca_c, y_train_c)

# Predict probabilities 
y_pred_proba_pca_c = log_reg_pca.predict_proba(X_test_pca_c)[:, 1]

auc_roc_pca        = roc_auc_score(y_test_c, y_pred_proba_pca_c)

print("\n--- CLASSIFICATION Results (Logistic Regression on PCA Features) ---")
print(f"AUC-ROC Score on Test Set (PCA): {auc_roc_pca:.4f}")

# Classification Report
y_pred_pca_c = log_reg_pca.predict(X_test_pca_c)
print("\nClassification Report (PCA):")
print(classification_report(y_test_c, y_pred_pca_c))
--- CLASSIFICATION Results (Logistic Regression on PCA Features) ---
AUC-ROC Score on Test Set (PCA): 0.6546

Classification Report (PCA):
              precision    recall  f1-score   support

           0       0.78      0.64      0.71       215
           1       0.38      0.55      0.45        85

    accuracy                           0.62       300
   macro avg       0.58      0.60      0.58       300
weighted avg       0.67      0.62      0.63       300

Feature Importance for CLASSIFICATION(DEFAULT) from PCA¶

image.png

Where P(x) = E[Y| X=x] = Probab(Y=1|X=x)

In [206]:
# The coefficients show the impact of each PC on the log-odds of defaulting
coefs_c = pd.Series(log_reg_pca.coef_[0], index=X_pca.columns).sort_values(ascending=False)
print("\nTop 5 PCA Factors for Default Risk:")
print(coefs_c)
Top 5 PCA Factors for Default Risk:
PC12    0.263729
PC18    0.096250
PC2     0.086298
PC17    0.078865
PC16    0.075884
PC1     0.069205
PC8     0.032825
PC5     0.015162
PC3     0.000561
PC10   -0.005295
PC15   -0.013606
PC13   -0.055558
PC9    -0.079374
PC11   -0.086603
PC14   -0.091244
PC4    -0.104644
PC6    -0.118635
PC7    -0.340028
dtype: float64
In [207]:
plt.figure(figsize=(10, 6))
coefs_c_sorted = coefs_c.sort_values(ascending=True)

sns.barplot(x=coefs_c_sorted.values, y=coefs_c_sorted.index, palette='icefire')

plt.title('Impact of PCA Factors on Default Risk ')
plt.xlabel('Coefficient Value (Impact on Log-Odds of Default)')
plt.ylabel('Principal Component Factor')
plt.axvline(0, color='red', linestyle='--', linewidth=2) # Baseline at zero

plt.text(coefs_c_sorted.max() * 0.95, -0.5, '↑ Increases Default Risk',
        fontsize=10, color='green', ha='right')
plt.text(coefs_c_sorted.min() * 0.95, -0.5, '↓ Decreases Default Risk',
        fontsize=10, color='blue', ha='left')

plt.tight_layout()
plt.show()
/tmp/ipykernel_4647/3790497150.py:4: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=coefs_c_sorted.values, y=coefs_c_sorted.index, palette='icefire')

Interpretation Classification Plot (Default Risk)

  • Positive Coeff: These PCs are positively correlated with the log-odds of defaulting. From the plot, we can see that PC1, PC9, PC10 are positive that means the factor for instance PC9:"Tax Behavior and Travel & Health Spending" significantly increases the likelihood of default(customer will default).

  • Negative Coefficients: These PCs are negatively correlated with default risk. We can see that PC3, PC4, etc are negative coefficients. They are "protective" factors or significantly decreases the likelihood that a customer will default.

8.2. Applying XGBoost Classification on PCA Data¶

In [208]:
# Train XGBoost Classification on PCA data with optimal parameters
xgb_pca_c = XGBClassifier(
    n_estimators    = 100
    , max_depth     = 6 
    , learning_rate = 0.1
    , random_state  = 42
    , eval_metric   = 'logloss'
)

# Fit the model
xgb_pca_c.fit(X_train_pca_c.values, y_train_c.values)

# Make predictions
y_pred_xgb_pca_c       = xgb_pca_c.predict(X_test_pca_c.values)
y_pred_proba_xgb_pca_c = xgb_pca_c.predict_proba(X_test_pca_c.values)[:, 1]

print("XGBoost Classification on PCA data trained successfully!")
XGBoost Classification on PCA data trained successfully!
In [209]:
# Classification Metrics for XGBoost on PCA
accuracy_xgb_pca_c  = accuracy_score(y_test_c, y_pred_xgb_pca_c)
precision_xgb_pca_c = precision_score(y_test_c, y_pred_xgb_pca_c)
recall_xgb_pca_c    = recall_score(y_test_c, y_pred_xgb_pca_c)
f1_xgb_pca_c        = f1_score(y_test_c, y_pred_xgb_pca_c)
roc_auc_xgb_pca_c   = roc_auc_score(y_test_c, y_pred_proba_xgb_pca_c)

print("=== XGBoost Classification Metrics (PCA Data) ===")
print(f"Accuracy:  {accuracy_xgb_pca_c:.4f}")
print(f"Precision: {precision_xgb_pca_c:.4f}")
print(f"Recall:    {recall_xgb_pca_c:.4f}")
print(f"F1-Score:  {f1_xgb_pca_c:.4f}")
print(f"ROC-AUC:   {roc_auc_xgb_pca_c:.4f}")
print("\n=== Classification Report ===")
print(classification_report(y_test_c, y_pred_xgb_pca_c))
print("\n=== Confusion Matrix ===")
print(confusion_matrix(y_test_c, y_pred_xgb_pca_c))
=== XGBoost Classification Metrics (PCA Data) ===
Accuracy:  0.7100
Precision: 0.4762
Recall:    0.2353
F1-Score:  0.3150
ROC-AUC:   0.6161

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.75      0.90      0.82       215
           1       0.48      0.24      0.31        85

    accuracy                           0.71       300
   macro avg       0.61      0.57      0.57       300
weighted avg       0.67      0.71      0.67       300


=== Confusion Matrix ===
[[193  22]
 [ 65  20]]
In [210]:
# Feature Importance for PCA components
importance_xgb_pca_c = pd.Series(xgb_pca_c.feature_importances_, index=X_train_pca_c.columns)

plt.figure(figsize  = (10, 6))
importance_sorted   = importance_xgb_pca_c.sort_values(ascending=True)
sns.barplot(x=importance_sorted.values, y=importance_sorted.index, palette='viridis')
plt.title('XGBoost Feature Importance on PCA Components (Classification)')
plt.xlabel('Feature Importance Score')
plt.ylabel('PCA Components')
plt.show()
/tmp/ipykernel_4647/3022377628.py:6: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=importance_sorted.values, y=importance_sorted.index, palette='viridis')
In [211]:
# ROC Curve for XGBoost on PCA
fpr_xgb_pca_c, tpr_xgb_pca_c, _     = roc_curve(y_test_c, y_pred_proba_xgb_pca_c)
roc_auc_xgb_pca_c_curve             = auc(fpr_xgb_pca_c, tpr_xgb_pca_c)

plt.figure(figsize=(10, 6))
plt.plot(fpr_xgb_pca_c, tpr_xgb_pca_c, color='darkorange', lw=2, 
         label=f'XGBoost on PCA ROC curve (AUC = {roc_auc_xgb_pca_c_curve:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - XGBoost Classification on PCA Data')
plt.legend(loc="lower right")
plt.grid(alpha=0.3)
plt.show()

8.3. Model Comparison: Logistic Regression vs XGBoost on PCA (Classification)¶

In [212]:
# Get Logistic Regression metrics on PCA
y_pred_logreg_pca       = log_reg_pca.predict(X_test_pca_c)
y_pred_proba_logreg_pca = log_reg_pca.predict_proba(X_test_pca_c)[:, 1]

accuracy_logreg_pca     = accuracy_score(y_test_c, y_pred_logreg_pca)
precision_logreg_pca    = precision_score(y_test_c, y_pred_logreg_pca)
recall_logreg_pca       = recall_score(y_test_c, y_pred_logreg_pca)
f1_logreg_pca           = f1_score(y_test_c, y_pred_logreg_pca)
roc_auc_logreg_pca      = roc_auc_score(y_test_c, y_pred_proba_logreg_pca)

# Create comparison dataframe
comparison_pca_c = pd.DataFrame({
    'Model'         : ['Logistic Regression', 'XGBoost']
    , 'Accuracy'    : [accuracy_logreg_pca, accuracy_xgb_pca_c]
    , 'Precision'   : [precision_logreg_pca, precision_xgb_pca_c]
    , 'Recall'      : [recall_logreg_pca, recall_xgb_pca_c]
    , 'F1-Score'    : [f1_logreg_pca, f1_xgb_pca_c]
    , 'ROC-AUC'     : [roc_auc_logreg_pca, roc_auc_xgb_pca_c]
})

print("=== PCA Classification Model Comparison ===")
print(comparison_pca_c.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Bar plot comparison
comparison_pca_c.set_index('Model').plot(kind='bar', ax=axes[0], rot=0)
axes[0].set_title('Classification Metrics Comparison on PCA Data', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Score')
axes[0].set_ylim([0, 1.0])
axes[0].legend(loc='lower right')
axes[0].grid(axis='y', alpha=0.3)

# ROC Curves comparison
fpr_logreg_pca, tpr_logreg_pca, _   = roc_curve(y_test_c, y_pred_proba_logreg_pca)
roc_auc_logreg_pca_curve            = auc(fpr_logreg_pca, tpr_logreg_pca)

axes[1].plot(fpr_logreg_pca, tpr_logreg_pca, color='blue', lw=2, 
             label=f'Logistic Regression (AUC = {roc_auc_logreg_pca_curve:.4f})')
axes[1].plot(fpr_xgb_pca_c, tpr_xgb_pca_c, color='darkorange', lw=2, 
             label=f'XGBoost (AUC = {roc_auc_xgb_pca_c_curve:.4f})')
axes[1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curves Comparison on PCA Data', fontsize=14, fontweight='bold')
axes[1].legend(loc="lower right")
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()
=== PCA Classification Model Comparison ===
              Model  Accuracy  Precision   Recall  F1-Score  ROC-AUC
Logistic Regression  0.616667   0.379032 0.552941  0.449761 0.654555
            XGBoost  0.710000   0.476190 0.235294  0.314961 0.616142

Interpretation from PCA Classification Model Comparison Table

Logistic Regression achieves higher recall (0.541) and F1-score (0.440), indicating it is better at identifying default cases, despite having lower overall accuracy (0.61) compared to XGBoost. XGBoost shows higher accuracy (0.71) and precision (0.469) but misses many defaults, as reflected by its low recall (0.176) and F1-score (0.256).

Overall, for credit risk assessment on PCA-transformed data, Logistic Regression is preferable due to its better balance between detecting defaults and maintaining interpretability, whereas XGBoost prioritizes correctly classifying non-default cases.

In [213]:
# Get Logistic Regression metrics on PCA
y_pred_logreg_pca        = log_reg_pca.predict(X_test_pca_c)
y_pred_proba_logreg_pca  = log_reg_pca.predict_proba(X_test_pca_c)[:, 1]

accuracy_logreg_pca      = accuracy_score(y_test_c, y_pred_logreg_pca)
precision_logreg_pca     = precision_score(y_test_c, y_pred_logreg_pca)
recall_logreg_pca        = recall_score(y_test_c, y_pred_logreg_pca)
f1_logreg_pca            = f1_score(y_test_c, y_pred_logreg_pca)
roc_auc_logreg_pca       = roc_auc_score(y_test_c, y_pred_proba_logreg_pca)

# Create comparison dataframe
comparison_pca_c = pd.DataFrame({
    'Model'      : ['Logistic Regression', 'XGBoost'],
    'Accuracy'   : [accuracy_logreg_pca, accuracy_xgb_pca_c],
    'Precision'  : [precision_logreg_pca, precision_xgb_pca_c],
    'Recall'     : [recall_logreg_pca, recall_xgb_pca_c],
    'F1-Score'   : [f1_logreg_pca, f1_xgb_pca_c],
    'ROC-AUC'    : [roc_auc_logreg_pca, roc_auc_xgb_pca_c]
})

print("=== PCA Classification Model Comparison ===")
print(comparison_pca_c.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# -------------------------------
# Replace axes[0] with PCA coefficient impact plot
# -------------------------------
coefs_c_sorted = coefs_c.sort_values(ascending=True)

sns.barplot(x=coefs_c_sorted.values, y=coefs_c_sorted.index, palette='icefire', ax=axes[0])

axes[0].set_title('Impact of PCA Factors on Default Risk', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Coefficient Value (Impact on Log-Odds of Default)')
axes[0].set_ylabel('Principal Component Factor')
axes[0].axvline(0, color='red', linestyle='--', linewidth=2)  # Baseline at zero

# Middle y-position
y_middle = len(coefs_c_sorted) / 2  

axes[0].text(coefs_c_sorted.max() * 0.95, y_middle, '↑ Increases Default Risk',
            fontsize=10, color='green', ha='right', va='center')
axes[0].text(coefs_c_sorted.min() * 0.95, y_middle, '↓ Decreases Default Risk',
            fontsize=10, color='blue', ha='left', va='center')

# -------------------------------
# ROC Curves comparison
# -------------------------------
fpr_logreg_pca, tpr_logreg_pca, _  = roc_curve(y_test_c, y_pred_proba_logreg_pca)
roc_auc_logreg_pca_curve           = auc(fpr_logreg_pca, tpr_logreg_pca)

axes[1].plot(fpr_logreg_pca, tpr_logreg_pca, color='blue', lw=2, 
            label=f'Logistic Regression (AUC = {roc_auc_logreg_pca_curve:.4f})')
axes[1].plot(fpr_xgb_pca_c, tpr_xgb_pca_c, color='darkorange', lw=2, 
            label=f'XGBoost (AUC = {roc_auc_xgb_pca_c_curve:.4f})')
axes[1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curves Comparison on PCA Data', fontsize=14, fontweight='bold')
axes[1].legend(loc="lower right")
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()
=== PCA Classification Model Comparison ===
              Model  Accuracy  Precision   Recall  F1-Score  ROC-AUC
Logistic Regression  0.616667   0.379032 0.552941  0.449761 0.654555
            XGBoost  0.710000   0.476190 0.235294  0.314961 0.616142
/tmp/ipykernel_4647/344327062.py:32: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=coefs_c_sorted.values, y=coefs_c_sorted.index, palette='icefire', ax=axes[0])
Key PCA Factors Driving Default Risk¶

The relative importance of principal components in predicting default was assessed using the Logistic Regression coefficients on the PCA-transformed dataset.

  • High-risk PCA factors (positive coefficients):

    • PC12 (+0.27), PC2 (+0.09), PC17 (+0.09), PC18 (+0.08), PC1 (+0.07)
      These components are positively associated with default risk, indicating that variations captured by these latent factors increase the likelihood of default.
  • Risk-reducing PCA factors (negative coefficients):

    • PC7 (-0.34), PC6 (-0.12), PC4 (-0.11), PC14 (-0.09), PC9 (-0.08)
      These components are associated with lower default risk, suggesting that variations captured by these factors reflect more stable financial behavior or protective patterns against credit default.

8.4. Baseline: Linear Regression on PCA Features¶

Compared to the original feature space, performance decreases significantly. This suggests that compressing the information into 18 components removed useful predictive signals, even though it reduces collinearity.

In [214]:
# 1. Split Data for Regression 
X_train_pca_r, X_test_pca_r, y_train_r, y_test_r = train_test_split(
    X_pca, y_regression, test_size=0.3, random_state=42)

# 2. Implement Baseline Model (Linear Regression) 
lin_reg_pca     = LinearRegression()
lin_reg_pca.fit(X_train_pca_r, y_train_r)

# 3. Evaluate Model 
y_pred_pca_r    = lin_reg_pca.predict(X_test_pca_r)

# Calculate Evaluation Metrics
rmse_pca    = np.sqrt(mean_squared_error(y_test_r, y_pred_pca_r))
r2_pca      = r2_score(y_test_r, y_pred_pca_r)
mae_pca     = np.mean(np.abs(y_test_r - y_pred_pca_r))

print("\n--- REGRESSION Results (Linear Regression on PCA Features) ---")
print(f"Root Mean Squared Error (RMSE - PCA): {rmse_pca:.2f}")
print(f"R-squared (R^2 - PCA): {r2_pca:.4f}")
print(f"Mean Absolute Error (MAE - PCA): {mae_pca:.2f}")
--- REGRESSION Results (Linear Regression on PCA Features) ---
Root Mean Squared Error (RMSE - PCA): 36.87
R-squared (R^2 - PCA): 0.6096
Mean Absolute Error (MAE - PCA): 26.96

Feature Importance for REGRESSION (CREDIT SCORE)¶

In [215]:
# The coefficients show the impact of each PC on the Credit Score value
coefs_r = pd.Series(lin_reg_pca.coef_, index=X_pca.columns).sort_values(ascending=False)
print("\nTop 5 PCA Factors for Credit Score:")
print(coefs_r.head())
Top 5 PCA Factors for Credit Score:
PC7     19.167418
PC6      7.219287
PC4      6.562823
PC14     5.504804
PC13     3.467785
dtype: float64

PCA Factors Driving Credit Score¶

Since Linear Regression is linear in PCA space, each coefficient quantifies how a one-unit increase in a component affects the predicted credit score (holding others constant).
The most influential components were:

PC7 (19.21), PC6 (7.11), PC4 (6.45), PC14 (5.85), PC9 (2.56)

In [216]:
plt.figure(figsize=(10, 6))
coefs_r_sorted  = coefs_r.sort_values(ascending=True)

sns.barplot(x   =coefs_r_sorted.values, y=coefs_r_sorted.index, palette='plasma')

plt.title('Impact of PCA Factors on Credit Score')
plt.xlabel('Coefficient Value (Impact on Credit Score)')
plt.ylabel('Principal Component Factor')
plt.axvline(0, color='red', linestyle='--', linewidth=2) # Baseline at zero

plt.text(coefs_r_sorted.max() * 0.95, -0.5, '↑ Increases Credit Score',
        fontsize=10, color='green', ha='right')
plt.text(coefs_r_sorted.min() * 0.95, -0.5, '↓ Decreases Credit Score',
        fontsize=10, color='blue', ha='left')

plt.tight_layout()
plt.show()
/tmp/ipykernel_4647/519707514.py:4: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x   =coefs_r_sorted.values, y=coefs_r_sorted.index, palette='plasma')

Interpretation Regression Plot (Credit Score)

  • Positive Coeff: These PCs increase predicted CREDIT_SCORE. The dominant effect of PC7 suggests that the latent behavior captured by this component corresponds to financially stable patterns.

  • Negative Coefficients: These PCs are negatively correlated with the CREDIT_SCORE. They decrease the predicted credit score. These components likely represent patterns associated with higher financial stress (e.g., debt pressure or essential spending burden).

8.5. XGBoost Regression on PCA Features¶

In [217]:
# Train XGBoost Regression on PCA with params
xgb_pca_r = XGBRegressor(
    n_estimators    = 100
    , max_depth     = 6
    , learning_rate = 0.1
    , random_state  = 42
    , objective     = 'reg:squarederror'
)

# Fit the model
xgb_pca_r.fit(X_train_pca_r.values, y_train_r.values)

# Make predictions
y_pred_xgb_pca_r = xgb_pca_r.predict(X_test_pca_r.values)

print("XGBoost Regression on PCA data trained successfully!")
XGBoost Regression on PCA data trained successfully!
In [218]:
# Regression Metrics for XGBoost on PCA
mae_xgb_pca_r   = mean_absolute_error(y_test_r, y_pred_xgb_pca_r)
mse_xgb_pca_r   = mean_squared_error(y_test_r, y_pred_xgb_pca_r)
rmse_xgb_pca_r  = np.sqrt(mse_xgb_pca_r)
r2_xgb_pca_r    = r2_score(y_test_r, y_pred_xgb_pca_r)

print("=== XGBoost Regression Metrics (PCA Data) ===")
print(f"Mean Absolute Error (MAE):  {mae_xgb_pca_r:.4f}")
print(f"Mean Squared Error (MSE):   {mse_xgb_pca_r:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_xgb_pca_r:.4f}")
print(f"R² Score:                   {r2_xgb_pca_r:.4f}")
=== XGBoost Regression Metrics (PCA Data) ===
Mean Absolute Error (MAE):  28.3259
Mean Squared Error (MSE):   1424.3450
Root Mean Squared Error (RMSE): 37.7405
R² Score:                   0.5909
In [219]:
# Feature Importance for PCA components (Regression)
importance_xgb_pca_r    = pd.Series(xgb_pca_r.feature_importances_, index=X_train_pca_r.columns)

plt.figure(figsize=(10, 6))
importance_sorted_r     = importance_xgb_pca_r.sort_values(ascending=True)
sns.barplot(x=importance_sorted_r.values, y=importance_sorted_r.index, palette='rocket')
plt.title('XGBoost Feature Importance on PCA Components (Regression)')
plt.xlabel('Feature Importance Score')
plt.ylabel('PCA Components')
plt.show()
/tmp/ipykernel_4647/3556927487.py:6: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=importance_sorted_r.values, y=importance_sorted_r.index, palette='rocket')

XGBoost identifies PC7 as the most influential component, consistent with the linear coefficient analysis. This agreement strengthens confidence that PC7 captures a latent factor strongly related to creditworthiness.

In [220]:
# Visualize Predictions vs Actual for XGBoost on PCA
plt.figure(figsize=(10, 6))
plt.scatter(y_test_r, y_pred_xgb_pca_r, alpha=0.6, edgecolors='k')
plt.plot([y_test_r.min(), y_test_r.max()], [y_test_r.min(), y_test_r.max()], 
         'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual CREDIT_SCORE')
plt.ylabel('Predicted CREDIT_SCORE')
plt.title('XGBoost on PCA: Predicted vs Actual CREDIT_SCORE')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# Residual plot
residuals_xgb_pca = y_test_r - y_pred_xgb_pca_r
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_xgb_pca_r, residuals_xgb_pca, alpha=0.6, edgecolors='k')
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel('Predicted CREDIT_SCORE')
plt.ylabel('Residuals')
plt.title('XGBoost on PCA: Residual Plot')
plt.grid(alpha=0.3)
plt.show()

Interpretation: Compared to the original-feature setting, predictions show larger dispersion and residuals, which matches the weaker $R^2$ on PCA data.

8.6 Model Comparison: Linear Regression vs XGBoost on PCA (Regression)¶

In [221]:
# Get Linear Regression metrics on PCA
y_pred_linreg_pca   = lin_reg_pca.predict(X_test_pca_r)

mae_linreg_pca      = mean_absolute_error(y_test_r, y_pred_linreg_pca)
mse_linreg_pca      = mean_squared_error(y_test_r, y_pred_linreg_pca)
rmse_linreg_pca     = np.sqrt(mse_linreg_pca)
r2_linreg_pca       = r2_score(y_test_r, y_pred_linreg_pca)

# Create comparison dataframe
comparison_pca_r = pd.DataFrame({
    'Model' : ['Linear Regression', 'XGBoost']
    , 'MAE' : [mae_linreg_pca, mae_xgb_pca_r]
    , 'MSE' : [mse_linreg_pca, mse_xgb_pca_r]
    , 'RMSE'    : [rmse_linreg_pca, rmse_xgb_pca_r]
    , 'R² Score': [r2_linreg_pca, r2_xgb_pca_r]
})

print("=== PCA Regression Model Comparison ===")
print(comparison_pca_r.to_string(index=False))

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Bar plot comparison for error metrics
error_metrics_pca   = comparison_pca_r[['Model', 'MAE', 'RMSE']].set_index('Model')
error_metrics_pca.plot(kind='bar', ax=axes[0], rot=0, color=['skyblue', 'lightcoral'])
axes[0].set_title('Regression Error Metrics Comparison on PCA Data (Lower is Better)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Error Value')
axes[0].legend(loc='upper right')
axes[0].grid(axis='y', alpha=0.3)


r2_comparison_pca   = comparison_pca_r[['Model', 'R² Score']].set_index('Model')
r2_comparison_pca.plot(kind='bar', ax=axes[1], rot=0, color='lightgreen', legend=False)
axes[1].set_title('R² Score Comparison on PCA Data (Higher is Better)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('R² Score')
axes[1].set_ylim([0, 1.0])
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Determine best model
best_model_pca_r2   = comparison_pca_r.loc[comparison_pca_r['R² Score'].idxmax(), 'Model']
best_model_pca_mae  = comparison_pca_r.loc[comparison_pca_r['MAE'].idxmin(), 'Model']

print(f"\nBest model by R² Score: {best_model_pca_r2}")
print(f"Best model by MAE: {best_model_pca_mae}")
=== PCA Regression Model Comparison ===
            Model       MAE         MSE      RMSE  R² Score
Linear Regression 26.963196 1359.150316 36.866656  0.609584
          XGBoost 28.325929 1424.344999 37.740495  0.590856
Best model by R² Score: Linear Regression
Best model by MAE: Linear Regression

Interpretation: XGBoost achieves slightly higher $R^2$, suggesting it models weak non-linearities between PCA components. However, the difference is small and both PCA-based models remain significantly worse than models trained on the full feature space.


9. Imbalance Handling techniques¶

In [222]:
df.shape
Out[222]:
(1000, 87)

Data Preprocessing and Applying SMOTE Technique¶

In [223]:
# 1. Define Features (X) and Targets (y)
y_classification = df['DEFAULT']
X_raw = df.drop(columns=['DEFAULT', 'CREDIT_SCORE', 'CREDIT_CATEGORY'])

# 2. Split Data for Classification
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_raw, y_classification, test_size=0.3, random_state=42, stratify=y_classification
)

# 3. Scaling the dataset
# We use StandardScaler because financial data often has outliers
scaler = StandardScaler()

# Fit on training data ONLY
X_train_c_scaled = scaler.fit_transform(X_train_c)

# Transform test data using the training fit
X_test_c_scaled = scaler.transform(X_test_c)

# 4. Handle Imbalance with SMOTE
# SMOTE is applied ONLY to the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_c_scaled, y_train_c)

print(f"Original training shape: {y_train_c.value_counts().to_dict()}")
print(f"Resampled training shape: {pd.Series(y_train_resampled).value_counts().to_dict()}")
Original training shape: {0: 501, 1: 199}
Resampled training shape: {1: 501, 0: 501}

9.1. Logistic Regression¶

In [224]:
# 1. Initialize Logistic Regression
log_smote = LogisticRegression(random_state=42, solver='liblinear')
log_smote.fit(X_train_resampled, y_train_resampled)

# 2. Evaluate Model using probabilities
y_pred_prob_smote = log_smote.predict_proba(X_test_c_scaled)[:, 1]

# 3. Calculate AUC-ROC
auc_smote = roc_auc_score(y_test_c, y_pred_prob_smote)
print("\n--- Classification Results (Logistic Regression + SMOTE) ---")
print(f"AUC-ROC Score on Test Set: {auc_smote:.4f}")

# 4. Generate Classification report
y_pred_smote = log_smote.predict(X_test_c_scaled)
print("\nClassification Report:")
print(classification_report(y_test_c, y_pred_smote))
--- Classification Results (Logistic Regression + SMOTE) ---
AUC-ROC Score on Test Set: 0.6260

Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.70      0.74       215
           1       0.40      0.51      0.45        85

    accuracy                           0.65       300
   macro avg       0.59      0.60      0.59       300
weighted avg       0.67      0.65      0.66       300

9.2. Random Forest with SMOTE¶

In [225]:
# Train Random Forest 
rf_smote = RandomForestClassifier(n_estimators=100, random_state=42)
rf_smote.fit(X_train_resampled, y_train_resampled)

# Y pred
y_rf = rf_smote.predict(X_test_c_scaled)
y_pred_probrf = rf_smote.predict_proba(X_test_c_scaled)[:, 1]

# Accuray
acc = accuracy_score(y_test_c, y_rf)
print(f"Random Forest Accuracy: {accuracy:.4f}")

# AUC-ROC Score 
auc_smote         = roc_auc_score(y_test_c, y_pred_probrf)
print(f"Random Forest AUC-ROC Score: {auc_smote:.4f}")

print("\nClassification Report:")
print(classification_report(y_test_c, y_rf))
Random Forest Accuracy: 0.7067
Random Forest AUC-ROC Score: 0.6206

Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.84      0.79       215
           1       0.40      0.27      0.32        85

    accuracy                           0.68       300
   macro avg       0.57      0.55      0.55       300
weighted avg       0.65      0.68      0.66       300

10. Overall Conclusion¶

This analysis evaluated classification (DEFAULT prediction) and regression (CREDIT_SCORE prediction) tasks using both original features (104 dimensions including 23 engineered features) and PCA-reduced features (18 dimensions, 90.4% variance). The study also assessed the effectiveness of three dimensionality reduction techniques: PCA, CCA, and FCA.

10.1 Classification Performance Analysis (DEFAULT Prediction)¶

Original Features vs PCA Features¶

Best Model on Original Features: Logistic Regression

  • Accuracy: 66.7%
  • ROC-AUC: 0.626
  • Precision: 42.7%, Recall: 51.8%, F1-Score: 46.8%
  • Strength: High F1-Score , good precision & recall
  • Weakness: low recall (missed 48% of actual defaulters)

Best Model on PCA Features: Logistic Regression

  • Accuracy: 61.67%
  • ROC-AUC: 0.653 (highest among all models)
  • Precision: 37.10%, Recall: 54.1%, F1-Score: 44%
  • Strength: Best discriminative ability, better recall balance
  • Weakness: Lower overall accuracy

Key Findings¶

  • PCA improves model discrimination: Applying PCA increased the ROC-AUC for Logistic Regression, indicating better ability to distinguish defaulters from non-defaulters.
  • Accuracy decreases with PCA: Overall accuracy dropped by 4.5–10.8 percentage points across models, showing a trade-off between predictive accuracy and class separation.
  • Trade-off between recall and accuracy: PCA helps detect more defaulters (higher recall) but reduces overall correct predictions.
  • Moderate predictive power: All models exhibit moderate ROC-AUC values (0.60–0.65), suggesting significant overlap between default and non-default classes.
  • Potential for feature improvement: The current feature set may be insufficient to fully separate classes; additional or engineered features could enhance predictive performance.

10.2. Regression Performance Analysis (CREDIT_SCORE Prediction)¶

Original Features vs PCA Features¶

Best Model on Original Features: Random Forest

  • R²: 0.7735 (explains 77.35% of variance)
  • MAE: 20.59 points
  • RMSE: 28.08 points

Best Model on PCA Features: XGBoost

  • R²: 0.6113 (explains 61.13% of variance)
  • MAE: 27.37 points
  • RMSE: 36.78 points

Key Findings¶

  • PCA caused significant performance degradation:

    • R² dropped by 15–19 percentage points (0.76 → 0.61)
    • MAE increased by 26.2% (21.25 → 26.96)
    • RMSE increased by 28.7% (28.63 → 36.87)
  • Loss of predictive information: The 18 principal components failed to preserve crucial relationships between features.

  • Linear structure of credit scoring: Linear Regression achieved R² = 0.7553, nearly matching Random Forest, suggesting that credit scoring is largely a linear problem when features are properly engineered.

  • Original features essential: The full 104-dimensional feature space, including engineered variables, is necessary for high-quality predictions.


10.3. Effectiveness of Dimensionality Reduction Techniques¶

1. PCA (Principal Component Analysis):¶

Configuration:

  • 18 components capturing 90.4% variance
  • 82.7% reduction in dimensionality (104 → 18 features)

Advantages:

  • Successfully compressed features while retaining 90% variance
  • Improved ROC-AUC for classification (+0.83 points for Logistic Regression)
  • Improve interpretability through component loadings
  • Better minority class recall (55.29% vs 52%)

Disadvantages:

  • Severe regression performance loss: R² dropped from 0.76 to 0.61 (-19.7%)
  • Reduced classification accuracy: Dropped 4.5-10.8 percentage points
  • Variance ≠ Predictive Power: 90% variance retention doesn't guarantee discriminative information

Verdict:

  • For Classification: Marginal benefit if minority class detection is critical
  • For Regression: Not recommended - substantial performance degradation
  • Best use case: Exploratory analysis, visualization, understanding data structure

2. CCA (Canonical Correlation Analysis):¶

Analysis: Financial Status (INCOME, SAVINGS, DEBT, ratios) vs. Spending Behavior (expenditure categories)

Results:

  • First canonical correlation: $\rho_{1}$ = 0.9914 (99.14% correlation!)
  • Second canonical correlation: $\rho_{2}$ = 0.7538 (75.38% correlation)

Insights Gained:

  • Revealed fundamental structure: Financial status and spending behavior are almost perfectly coupled
  • Validated feature engineering: Confirms spending patterns are highly predictable from financial metrics
  • Explained model behavior: The tight coupling explains why financial ratios (R_DEBT_INCOME, R_SAVINGS_INCOME) dominate importance rankings
  • Identified multicollinearity: 99% correlation suggests redundancy between financial and behavioral features

Practical Implications:

  • Can safely reduce either financial OR behavioral features without major information loss
  • Simpler models may suffice given the near-linear relationship
  • Customer spending is highly constrained by financial capacity

Verdict: Excellent technique for understanding multivariate relationships; provides actionable insights for feature engineering and model interpretation

3. FCA/Correspondence Analysis: EXCELLENT FOR CATEGORICAL INSIGHTS¶

Analysis 1: CREDIT_CATEGORY vs DEFAULT

Category FCA Coordinate Interpretation
Excellent -0.6298 Strongly aligned with non-default
Good -0.2184 Slightly aligned with non-default
Fair +0.0873 Slightly leaning toward risk
Poor +0.8486 Strongly aligned with default
  • Clear monotonic gradient: Excellent → Poor shows 1.48 units separation
  • Validates categorization: CREDIT_CATEGORY is a strong proxy for default risk

Analysis 2: CAT_GAMBLING vs DEFAULT

Gambling Level FCA Coordinate Interpretation
No (0) -0.0361 Aligned with non-default
Low (1) +0.0011 Neutral
High (2) +0.0842 Aligned with default risk
  • Positive risk gradient: Gambling intensity positively associated with default

Insights Gained:

  • Validated categorical features: Discretized variables capture meaningful risk gradients
  • Identified non-linear effects: Categorical binning revealed relationships continuous features might miss
  • Visual interpretability: 1D FCA plots provide clear, stakeholder-friendly visualizations
  • Feature importance confirmation: Both variables emerged as important in Random Forest models

Verdict: Highly effective for exploratory analysis of categorical variables; provides clear, interpretable insights that complement quantitative models